java - A regular expression for harvesting include and require directives -


i trying harvest inclusion directives php file using a regular expression (in java).

the expression should pick have file names expressed unconcatenated string literals. ones constants or variables not necessary.

detection should work both single , double quotes, include-s , require-s, plus additional trickery _once , last not least, both keyword- , function-style invocations.

a rough input sample:

<?php  require('a.php'); require 'b.php'; require("c.php"); require "d.php";  include('e.php'); include 'f.php'; include("g.php"); include "h.php";  require_once('i.php'); require_once 'j.php'; require_once("k.php"); require_once "l.php";  include_once('m.php'); include_once 'n.php'; include_once("o.php"); include_once "p.php";  ?> 

and output:

["a.php","b.php","c.php","d.php","f.php","g.php","h.php","i.php","j.php","k.php","l.php","m.php","n.php","o.php","p.php"] 

any ideas?

to accurately, need parse php source code. because text sequence: require('a.php'); can appear in places not include @ - such in comments, strings , html markup. example, following not real php includes, matched regex:

<?php // examples regex solution gets false positives:     /* php multi-line comment with: require('a.php'); */     // php single-line comment with: require('a.php');     $str = "double quoted string with: require('a.php');";     $str = 'single quoted string with: require("a.php");'; ?>     <p>html paragraph with: require('a.php');</p> 

that said, if happy getting few false positives, following single regex solution pretty job of scraping filenames php include variations:

// filenames php include variations , return in array. function getincludes($text) {     $count = preg_match_all('/         # match php include variations single string literal filename.         \b              # anchor word boundary.         (?:             # group include variation alternatives.           include       # either "include"         | require       # or "require"         )               # end group of include variation alternatives.         (?:_once)?      # either 1 may "once" variation.         \s*             # optional whitespace.         (               # $1: optional opening parentheses.           \(            # literal open parentheses,           \s*           # followed optional whitespace.         )?              # end $1: optional opening parentheses.         (?|             # "branch reset" group of filename alts.           \'([^\']+)\'  # either $2{1]: single quoted filename,         | "([^"]+)"     # or $2{2]: double quoted filename.         )               # end branch reset group of filename alts.         (?(1)           # if there opening parentheses,           \s*           # allow optional whitespace           \)            # followed closing parentheses.         )               # end group $1 if conditional.         \s*             # end statement optional whitespace         ;               # followed semi-colon.         /ix', $text, $matches);     if ($count > 0) {         $filenames = $matches[2];     } else {         $filenames = array();     }     return $filenames; } 

additional 2011-07-24 turns out op wants solution in java not php. here tested java program identical. note not java expert , don't know how dynamically size array. thus, solution below (crudely) sets fixed size array (100) hold array of filenames.

import java.util.regex.*; public class test {     // set maximum size of array of filenames.     public static final int max_names = 100;     // filenames php include variations , return in array.     public static string[] getincludes(string text)     {         int count = 0;                          // count of filenames.         string filenames[] = new string[max_names];         string filename;         pattern p = pattern.compile(             "# match include variations single string filename. \n" +             "\\b             # anchor word boundary.              \n" +             "(?:             # group include variation alternatives. \n" +             "  include       # either 'include',                     \n" +             "| require       # or 'require'.                         \n" +             ")               # end group of include variation alts.  \n" +             "(?:_once)?      # either 1 may have '_once' suffix.   \n" +             "\\s*            # optional whitespace.                  \n" +             "(?:             # group optional opening paren.     \n" +             "  \\(           # literal open parentheses,             \n" +             "  \\s*          # followed optional whitespace.      \n" +             ")?              # opening parentheses optional.     \n" +             "(?:             # group filename alternatives.      \n" +             "  '([^']+)'     # $1: either single quoted filename,  \n" +             "| \"([^\"]+)\"  # or $2: double quoted filename.      \n" +             ")               # end group of filename alternativess.  \n" +             "(?:             # group optional closing paren.     \n" +             "  \\s*          # optional whitespace,                  \n" +             "  \\)           # followed closing parentheses.  \n" +             ")?              # closing parentheses optional .     \n" +             "\\s*            # end statement optional ws,       \n" +             ";               # followed semi-colon.               ",             pattern.case_insensitive | pattern.unicode_case | pattern.comments);         matcher m = p.matcher(text);         while (m.find() && count < max_names) {             // filename in either $1 or $2             if (m.group(1) != null) filename = m.group(1);             else                    filename = m.group(2);             // add filename array of filenames.             filenames[count++] = filename;         }         return filenames;     }     public static void main(string[] args)     {         // test string full of various php include statements.         string text = "<?php\n"+             "\n"+             "require('a.php');\n"+             "require 'b.php';\n"+             "require(\"c.php\");\n"+             "require \"d.php\";\n"+             "\n"+             "include('e.php');\n"+             "include 'f.php';\n"+             "include(\"g.php\");\n"+             "include \"h.php\";\n"+             "\n"+             "require_once('i.php');\n"+             "require_once 'j.php';\n"+             "require_once(\"k.php\");\n"+             "require_once \"l.php\";\n"+             "\n"+             "include_once('m.php');\n"+             "include_once 'n.php';\n"+             "include_once(\"o.php\");\n"+             "include_once \"p.php\";\n"+             "\n"+             "?>\n";         string filenames[] = getincludes(text);         (int = 0; < max_names && filenames[i] != null; i++) {             system.out.print(filenames[i] +"\n");         }     } } 

Comments

Popular posts from this blog

linux - Using a Cron Job to check if my mod_wsgi / apache server is running and restart -

actionscript 3 - TweenLite does not work with object -

jQuery Ajax Render Fragments OR Whole Page -