in reply to Comparing pattern

Have you tried it like this?
my $patterns = "/path/to/file.txt"; my $arg1 = shift; open( PATTERNS, "<", $patterns ) or die "$patterns: $!\n"; my @list_patterns = <PATTERNS>; close PATTERNS; chomp @list_patterns; my $list_regex = join '|', @list_patterns; open( FILE, "<", $arg1 ) or die "$arg1: $!\n"; while (<FILE>) { if ( /($list_regex)/ ) { print "\n$arg1\n$1\n"; } }
If your list of patterns does not include anything that tries to match across a line-break (i.e.: "...\n..."), then you don't need to slurp your whole "arg1" file content into memory at one time. Depending on the files that you are searching through, that can save time by avoiding memory swaps, and depending on what sort of patterns you are looking for, applying the regex to a small string (one line at a time) could be a lot faster than applying it to a whole file.

If your patterns do involve matching across line breaks, loading them all into a single regex (joining them together with "|") will probably speed things up anyway, because you only do one regex match against the whole string.

Replies are listed 'Best First'.
Re^2: Comparing pattern
by mrc (Sexton) on Sep 19, 2009 at 07:57 UTC
    Thanks a lot guys!!! Your both suggestions helped me to solve this problem. Based on bv suggestion regarding qr//, I also found a great post: http://www.perlmonks.org/?node_id=661292. That Regexp::Assemble do a great job.
      Regexp::Assemble would not track the original pattern correctly. I give up, it's too hard for me :(
      use Regexp::Assemble; my $patterns = "/path/to/file.txt"; my $list_regex = Regexp::Assemble->new(file => $patterns); $list_regex->track( 1 ); open( FILE, "<", "$arg1") or die "$arg1: $!\n"; while (<FILE>) { if (/$list_regex/) {print "\n$arg1\n$list_regex->matched\n";} } close(FILE); }
      Now I have this code but I'm facing a new problem. In my first example I use both flags /is I need /s so . to match newlines as well. I have this
      my $patterns = "/path/to/file.txt"; my $arg1 = shift; open( PATTERNS, "<", $patterns ) or die "$patterns: $!\n"; my @list_patterns = <PATTERNS>; close PATTERNS; chomp @list_patterns; my $regexStr = "(" . join("|", @list_patterns) . ")"; my $list_regex = qr{$regexStr}i; open( FILE, "<", "$arg1") or die "$arg1: $!\n"; while (<FILE>) { if (/$list_regex/) {print "\n$arg1\n$1\n";} } close(FILE);

      Adding s to both
      my $list_regex = qr{$regexStr}is; or
      if (/$list_regex/is)
      would not solve the problem.
      part1.*part2
      This pattern working with my original script. .* should match also newline.
      fggffgfg part1 hghggh ghhggh hggh part2 ytyty
      This is the last problem, else the script working perfectly and much faster thanks to your advices. I will next take a look at local $/.
        You say that you "need /s so . to match newlines as well", and I assume this means that if your pattern list includes something like foo.*?bar, you would want that to match either of the following examples (where I'll put parens around the intended match):
        example 1: blah (foobar) blah example 2: blah (foo all sorts of content on lots of lines of data bar) blah
        In that case, you need to read the data file in slurp mode:
        my $regex_string = join '|', @list_patterns; open( FILE, "<", $arg1 ) or die "$arg1: $!"; $_ = do { local $/; <FILE> }; # temporarily set $/ = undef to slurp f +ile close FILE; if ( /($regex_string)/is ) { # got a match... }
        Regarding the placement of parens and regex qualifiers, you could do that in other ways, like:
        my $list_regex = qr/($regex_string)/is; ... if ( /$list_regex/ ) { ...
        No difference, really, but when there's a choice, I like having the parens visible in the code near the place where I use $1, $2, etc.
Re^2: Comparing pattern
by LanX (Saint) on Sep 19, 2009 at 11:31 UTC
    hmmm ... I think your code only reports the first pattern matching while the OP's example should print every pattern matching...

    ...but I don't know what's really indented...

    Cheers Rolf

Re^2: Comparing pattern
by mrc (Sexton) on Sep 20, 2009 at 18:27 UTC
    graff, based on your suggestion my script working pretty well now.
    Thank you once again!

    However, I'm not able to track down original pattern as $1 is no longer working as you suggested in first example.
    print "\n$arg1\n$1\n";
    I have 1000+ patterns and I will add more. I need this scanner to fight against those bloody spammers.
    See this example file, but with thousands of links and similar lines.
    <?xml version="1.0" encoding="UTF-8"?> <urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9"> <url> <loc>http://test.com/index.html</loc> <lastmod>2009-08-21</lastmod> <changefreq>monthly</changefreq> <priority>0.7</priority> </url> <url> <loc>http://test.com/page_1.html</loc> <lastmod>2009-08-06</lastmod> <changefreq>monthly</changefreq> <priority>0.8</priority> </url> <url> <loc>http://test.com/page_2.html</loc> <lastmod>2009-08-10</lastmod> <changefreq>monthly</changefreq> <priority>0.4</priority> </url>

    I don't know why patterns like these (each on newline) will fail against this file. The script enter an interminable loop, using an entire cpu core.
    <a href=.*<a href=.*poker page.{0,10}.html.*page.{0,10}.html.*page.{0,10}.html.*map

    When I change them to
    <a href=.*<a href page.{0,10}.html.*page.{0,10}.html.*page.{0,10}.html

    working without problem.
      You say that you want to capture and print the string that matches, but you aren't getting the output you expect? Your examples don't make any sense to me. (How does that xml snippet relate to anything?)

      If you're still having a problem with capturing and printing matches the way you want, you should post a minimal, self-contained example consisting of:

      • runnable code
      • a file containing some regex patterns
      • a data sample containing the kind of pattern you want to capture.
      Maybe in the process of putting that together, you'll realize where the problem really is, and solve it on your own. Good luck.
        Ok, to be more clear.

        I have this code:
        #!/usr/bin/perl -w use strict; my $patterns = "/path/to/patterns.txt"; my $arg1 = shift; open (PAT, '<', $patterns) or die "$patterns: $!\n"; my @patterns = <PAT>;. close(PAT); chomp @patterns; my $regex_string = join '|', @patterns; open( FILE, "<", "$arg1") or die "$arg1: $!\n"; $_ = do { local $/; <FILE> }; close(FILE); if ( /($regex_string)/is ) {print "\n$arg1\n$1\n";}
        Test list with patterns:
        /path/to/patterns.txt
        part1.*part2 Foo bar Other pattern
        Test file to scan:
        hghghgghghh part1 fff part2 jhhjhjkjk Foo bar kkjkjkj Other pattern
        $1 will show all wildcarded text between part1 and part2 and not only the pattern part1.*part2 as it should.
        /path/to/file part1 fff part2

        Also, only first pattern found is displayed now. That's not a problem, but I'd also like to know how to display all patterns if a file contains more than one.
        Please bear an unexperienced user like me. Thank you!
        Regarding the other problem with xml file to scan, I must do more tests to know exactly where the problem is.
        Thank you very much for your reply. I can't check the code these days, but I will let you know how it's going asap.
      Graff, you're an awesome guy!! The final version of the script working exactly in the manner I want except a little problem.
      I will elaborate it.

      I have this file to scan: example.txt (used by scammers/abusers)
      http://uploading.com/files/get/cm3364a5/
      This pattern
      page.{0,10}html.*?page.{0,10}html.*?</changefreq
      working because example.txt contain all elements of pattern.

      If I change pattern to
      page.{0,10}html.*?page.{0,10}html.*?kkk
      (example.txt doesn't containt kkk), script enters a loop and CPU usage become very high.
      I think it's related to excesive number of the same pattern, because reducing example.txt to only few lines solves the problem.
      Any idea how to solve this bug?