java - A regular expression for harvesting include and require directives -
i trying harvest inclusion directives php file using a regular expression (in java).
the expression should pick have file names expressed unconcatenated string literals. ones constants or variables not necessary.
detection should work both single , double quotes, include
-s , require
-s, plus additional trickery _once
, last not least, both keyword- , function-style invocations.
a rough input sample:
<?php require('a.php'); require 'b.php'; require("c.php"); require "d.php"; include('e.php'); include 'f.php'; include("g.php"); include "h.php"; require_once('i.php'); require_once 'j.php'; require_once("k.php"); require_once "l.php"; include_once('m.php'); include_once 'n.php'; include_once("o.php"); include_once "p.php"; ?>
and output:
["a.php","b.php","c.php","d.php","f.php","g.php","h.php","i.php","j.php","k.php","l.php","m.php","n.php","o.php","p.php"]
any ideas?
to accurately, need parse php source code. because text sequence: require('a.php');
can appear in places not include @ - such in comments, strings , html markup. example, following not real php includes, matched regex:
<?php // examples regex solution gets false positives: /* php multi-line comment with: require('a.php'); */ // php single-line comment with: require('a.php'); $str = "double quoted string with: require('a.php');"; $str = 'single quoted string with: require("a.php");'; ?> <p>html paragraph with: require('a.php');</p>
that said, if happy getting few false positives, following single regex solution pretty job of scraping filenames php include variations:
// filenames php include variations , return in array. function getincludes($text) { $count = preg_match_all('/ # match php include variations single string literal filename. \b # anchor word boundary. (?: # group include variation alternatives. include # either "include" | require # or "require" ) # end group of include variation alternatives. (?:_once)? # either 1 may "once" variation. \s* # optional whitespace. ( # $1: optional opening parentheses. \( # literal open parentheses, \s* # followed optional whitespace. )? # end $1: optional opening parentheses. (?| # "branch reset" group of filename alts. \'([^\']+)\' # either $2{1]: single quoted filename, | "([^"]+)" # or $2{2]: double quoted filename. ) # end branch reset group of filename alts. (?(1) # if there opening parentheses, \s* # allow optional whitespace \) # followed closing parentheses. ) # end group $1 if conditional. \s* # end statement optional whitespace ; # followed semi-colon. /ix', $text, $matches); if ($count > 0) { $filenames = $matches[2]; } else { $filenames = array(); } return $filenames; }
additional 2011-07-24 turns out op wants solution in java not php. here tested java program identical. note not java expert , don't know how dynamically size array. thus, solution below (crudely) sets fixed size array (100) hold array of filenames.
import java.util.regex.*; public class test { // set maximum size of array of filenames. public static final int max_names = 100; // filenames php include variations , return in array. public static string[] getincludes(string text) { int count = 0; // count of filenames. string filenames[] = new string[max_names]; string filename; pattern p = pattern.compile( "# match include variations single string filename. \n" + "\\b # anchor word boundary. \n" + "(?: # group include variation alternatives. \n" + " include # either 'include', \n" + "| require # or 'require'. \n" + ") # end group of include variation alts. \n" + "(?:_once)? # either 1 may have '_once' suffix. \n" + "\\s* # optional whitespace. \n" + "(?: # group optional opening paren. \n" + " \\( # literal open parentheses, \n" + " \\s* # followed optional whitespace. \n" + ")? # opening parentheses optional. \n" + "(?: # group filename alternatives. \n" + " '([^']+)' # $1: either single quoted filename, \n" + "| \"([^\"]+)\" # or $2: double quoted filename. \n" + ") # end group of filename alternativess. \n" + "(?: # group optional closing paren. \n" + " \\s* # optional whitespace, \n" + " \\) # followed closing parentheses. \n" + ")? # closing parentheses optional . \n" + "\\s* # end statement optional ws, \n" + "; # followed semi-colon. ", pattern.case_insensitive | pattern.unicode_case | pattern.comments); matcher m = p.matcher(text); while (m.find() && count < max_names) { // filename in either $1 or $2 if (m.group(1) != null) filename = m.group(1); else filename = m.group(2); // add filename array of filenames. filenames[count++] = filename; } return filenames; } public static void main(string[] args) { // test string full of various php include statements. string text = "<?php\n"+ "\n"+ "require('a.php');\n"+ "require 'b.php';\n"+ "require(\"c.php\");\n"+ "require \"d.php\";\n"+ "\n"+ "include('e.php');\n"+ "include 'f.php';\n"+ "include(\"g.php\");\n"+ "include \"h.php\";\n"+ "\n"+ "require_once('i.php');\n"+ "require_once 'j.php';\n"+ "require_once(\"k.php\");\n"+ "require_once \"l.php\";\n"+ "\n"+ "include_once('m.php');\n"+ "include_once 'n.php';\n"+ "include_once(\"o.php\");\n"+ "include_once \"p.php\";\n"+ "\n"+ "?>\n"; string filenames[] = getincludes(text); (int = 0; < max_names && filenames[i] != null; i++) { system.out.print(filenames[i] +"\n"); } } }
Comments
Post a Comment