in reply to Re: Comparing pattern
in thread Comparing pattern

graff, based on your suggestion my script working pretty well now.
Thank you once again!

However, I'm not able to track down original pattern as $1 is no longer working as you suggested in first example.
print "\n$arg1\n$1\n";
I have 1000+ patterns and I will add more. I need this scanner to fight against those bloody spammers.
See this example file, but with thousands of links and similar lines.
<?xml version="1.0" encoding="UTF-8"?> <urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9"> <url> <loc>http://test.com/index.html</loc> <lastmod>2009-08-21</lastmod> <changefreq>monthly</changefreq> <priority>0.7</priority> </url> <url> <loc>http://test.com/page_1.html</loc> <lastmod>2009-08-06</lastmod> <changefreq>monthly</changefreq> <priority>0.8</priority> </url> <url> <loc>http://test.com/page_2.html</loc> <lastmod>2009-08-10</lastmod> <changefreq>monthly</changefreq> <priority>0.4</priority> </url>

I don't know why patterns like these (each on newline) will fail against this file. The script enter an interminable loop, using an entire cpu core.
<a href=.*<a href=.*poker page.{0,10}.html.*page.{0,10}.html.*page.{0,10}.html.*map

When I change them to
<a href=.*<a href page.{0,10}.html.*page.{0,10}.html.*page.{0,10}.html

working without problem.

Replies are listed 'Best First'.
Re^3: Comparing pattern
by graff (Chancellor) on Sep 20, 2009 at 19:16 UTC
    You say that you want to capture and print the string that matches, but you aren't getting the output you expect? Your examples don't make any sense to me. (How does that xml snippet relate to anything?)

    If you're still having a problem with capturing and printing matches the way you want, you should post a minimal, self-contained example consisting of:

    • runnable code
    • a file containing some regex patterns
    • a data sample containing the kind of pattern you want to capture.
    Maybe in the process of putting that together, you'll realize where the problem really is, and solve it on your own. Good luck.
      Ok, to be more clear.

      I have this code:
      #!/usr/bin/perl -w use strict; my $patterns = "/path/to/patterns.txt"; my $arg1 = shift; open (PAT, '<', $patterns) or die "$patterns: $!\n"; my @patterns = <PAT>;. close(PAT); chomp @patterns; my $regex_string = join '|', @patterns; open( FILE, "<", "$arg1") or die "$arg1: $!\n"; $_ = do { local $/; <FILE> }; close(FILE); if ( /($regex_string)/is ) {print "\n$arg1\n$1\n";}
      Test list with patterns:
      /path/to/patterns.txt
      part1.*part2 Foo bar Other pattern
      Test file to scan:
      hghghgghghh part1 fff part2 jhhjhjkjk Foo bar kkjkjkj Other pattern
      $1 will show all wildcarded text between part1 and part2 and not only the pattern part1.*part2 as it should.
      /path/to/file part1 fff part2

      Also, only first pattern found is displayed now. That's not a problem, but I'd also like to know how to display all patterns if a file contains more than one.
      Please bear an unexperienced user like me. Thank you!
      Regarding the other problem with xml file to scan, I must do more tests to know exactly where the problem is.
        only first pattern found is displayed now. That's not a problem, but I'd also like to know how to display all patterns if a file contains more than one.

        That's easy -- instead of using an "if" statement like this:

        if (/($regex_string)/is) {
        just use a while loop like this -- making sure to add the "g" modifier (and while I'm at it, I'll add some clarification to the output):
        while (/$regex_string)/isg) { print "\nmatched in $arg1:\n==$1==\n"; }
        As for your other issue:

        $1 will show all wildcarded text between part1 and part2 and not only the pattern part1.*part2 as it should.

        What makes you think it "should" display the string "part1.*part2"? When using the capture variables ($1, $2, ...), the normal situation is to want the actual (complete, literal) string that matched the regex, rather than the regex string with its wildcards.

        If you want the wildcard-enabled regexes in your list to return a specific constant string, you'll probably want to include that string in your regex list file, store those replacement strings with their regexes in a separate hash, and add some logic in the while loop shown above that will replace any given matching string with the appropriate constant replacement string. Here's an adapted version the three files involved:

      Thank you very much for your reply. I can't check the code these days, but I will let you know how it's going asap.
Re^3: Comparing pattern
by mrc (Sexton) on Sep 26, 2009 at 10:33 UTC
    Graff, you're an awesome guy!! The final version of the script working exactly in the manner I want except a little problem.
    I will elaborate it.

    I have this file to scan: example.txt (used by scammers/abusers)
    http://uploading.com/files/get/cm3364a5/
    This pattern
    page.{0,10}html.*?page.{0,10}html.*?</changefreq
    working because example.txt contain all elements of pattern.

    If I change pattern to
    page.{0,10}html.*?page.{0,10}html.*?kkk
    (example.txt doesn't containt kkk), script enters a loop and CPU usage become very high.
    I think it's related to excesive number of the same pattern, because reducing example.txt to only few lines solves the problem.
    Any idea how to solve this bug?