jhanna has asked for the wisdom of the Perl Monks concerning the following question:

My spam filter uses a regular expression with about 100 terms |ed together. Perl get stuck when some strings are evaluated against the regular expression. I broke the re into pieces and it began working again.

These are generally simple strings. I could do something like:

foreach (@string) { return 1 if index($s,$_); } return 0;
But That must be slower than
return $s=~/$match1/io;
especially because this is done hundreds of times per email.

So what I'm looking for is something like flex to generate a fast RE scanner for BIG re's, without some of the limitations (or power) of perl's builtin re's.

I've poked around at cpan, but noting jumped out at me.

j

Replies are listed 'Best First'.
Re: regex is too long
by japhy (Canon) on May 08, 2001 at 20:37 UTC
    You might want to try forming the words into an optimized regex. Also, you might want to make sure the strings are quotemeta()ed, in case they contain regex characters.
    my $regex = join '|', map quotemeta($_), @strings; $regex = qr/$regex/;
    As far as optimizing goes, since Perl uses NFA (not DFA), you'll get an efficiency boost:
    /out|this|once|those or|about|bout/ # will work better as /th(is|ose or)|((a?)b?)out|once/
    Update: you also want to put the smaller stop-words towards the beginning of the regex, if at all possible.

    japhy -- Perl and Regex Hacker
(tye)Re: regex is too long
by tye (Sage) on May 08, 2001 at 22:38 UTC

    See SAS log scanner for some benchmarks and some code by tilly for optimizing such regular expressions.

            - tye (but my friends call me "Tye")
      Thanks for the excellent reference -- that discussion is quite to the point. I guess I should run a benchmark to figure which is actually faster: multiple index() or hand-optimized regex... I worried about the regex that by adding too many parenthesis it would further bog it down, but until I try I guess there's no way to know.

      j

Re: regex is too long
by dws (Chancellor) on May 08, 2001 at 20:53 UTC
    Perl get stuck when some strings are evaluated against the regular expression.

    Please elaborate on what you mean by "Perl get stuck". Slows to a crawl? Hangs? Crashes in flames?

    And can you characterize what's special about "some strings"?

    It might help to post some code. Show us the table of strings you're constructing the regex from, and give us an example string that it gets stuck on.

    Without some hard data to work from, we can only speculate and make educated guesses.

      "Perl gets stuck" means that the when the process gets to the regex line in the code it never returns and appears (according to ps) to be consuming large amounts of CPU (ie endless loop). It only takes a few dozen of these to bring the system to its knees.

      I won't post the entire table because it's long and rude. However, you'll quickly get the gist.

      $evil= ## list of re phrases t 'barely legal Unsensored pics rated adult site (find out|learn|discover) ANYTHING about anyone (remove.*\@dcemail\.com) bagboy\@burmeses\.net \(a\)\s*\(2\)\s*\(C\).*1618 this limited time free offer thousands of extra dollars earn a great monthly income \bfat absorber\b chain letter.*pyramid scheme pyramid scheme.*chain letter e-mail\w* work\w*\! Earn BIG \$\$\$ block this remove account quit watching others get rich bulk email works! firmer erections vaginal lubrication s e x drive Enhances Orgasms eraseus@yahoo.com over \d+ million fresh email content-\w+: .* .*name\s*=\s*".*\.(exe|scr|pif|vbs)" And it\'s 100% LEGAL! No Hidden Fees'; ## actually is 3 x this long $evil=~s/\n/\|/g; $evil=~s/ +/ /g; # ... skipping ahead ... while($l=&getnextline) { $_="$l$lastline"; ## combine this line and the last s/\s+/ /g; ## simplify white space matching $isSpam = $isSpam || /$evil/io; # ... etc ... }
      What is special about some strings, I don't know. I spent some time trying to debug it, and when I found that I could just break the regular expression into parts (ie $evil1, $evil2, $evil3) and it started working again I didn't spend much more time on it. There was no clear pattern to me why it was going into the endless loop. I'd be happy to email you the complete code and test data which causes it to break. I'm running Perl 5.005_03 (freebsd) and wondered if upgrading to the new release of perl would fix it (I read it resolved some re bugs).

      Does that help?

        I suspect that you have some run-away patterns in there. Try reading Death to Dot Star!. After you read that you should have the background to see why patterns like:
        content-\w+: .* .*name\s*=\s*".*\.(exe|scr|pif|vbs)"
        are a lot of work to calculate. (Hit one content-type: and then you force a ton of scanning and backtracking through the whole string.)

        In fact I am going to guess that either that or another RE is really efficient. Perl's RE engine's optimizations are able to spot and fix it when they see the offending pattern in a small RE, but with a large one they give up analyzing before fixing the disaster. (It might, in fact, be that line.)

        And yes, it is possible that a new release of Perl will fix it. Ilya added more sanity checks to catch even more of these and may catch a disaster he missed before. But that will be less reliable than looking through your file for .*'s, and particularly .*'s that will force a lot of backtracking to happen. (At least make sure that between any 2 .*'s there is some meaningful text.)

Re: regex is too long
by Albannach (Monsignor) on May 08, 2001 at 22:17 UTC
    I don't think that index must be slower than a regex. It seems to me that it should be much faster as it should be a much simpler function, and regexen don't do bulk alternation very quickly either. Of course you may be able to represent 100 simple text strings in fewer than 50 sub-expressions in a hand-optimized regex, so it all depends on how much work you want to have to put into this, and how often you're going to change the list. Perhaps a benchmark is in order here. Now I expect the relative speeds will vary quite a bit as the size of the string to search, the number of words in the match list, and the frequency of matches changes, but this may tell us something nonetheless (or not ;-).

    It looks like index is the winner in this simple test, but I trust that wiser monks will let me know if and how I'm producing bad results!

    First results when there is no match found:

    Benchmark: running BigRE, Index, OptRE, m// loop, each for at least 10 + CPU seconds... BigRE: 11 wallclock secs (10.56 usr + 0.00 sys = 10.56 CPU) @ 11 +690.90/s (n=123491) Index: 11 wallclock secs (10.61 usr + 0.00 sys = 10.61 CPU) @ 43 +921.29/s (n=465961) OptRE: 11 wallclock secs (10.59 usr + 0.00 sys = 10.59 CPU) @ 17 +239.59/s (n=182619) m// loop: 11 wallclock secs (10.50 usr + 0.00 sys = 10.50 CPU) @ 68 +25.71/s (n=71670)
    Then results when there is always a match found:
    Benchmark: running BigRE, Index, OptRE, m// loop, each for at least 10 + CPU seconds... BigRE: 11 wallclock secs (10.39 usr + 0.00 sys = 10.39 CPU) @ 12 +548.26/s (n=130389) Index: 10 wallclock secs (10.53 usr + 0.00 sys = 10.53 CPU) @ 56 +956.89/s (n=599813) OptRE: 11 wallclock secs (10.53 usr + 0.00 sys = 10.53 CPU) @ 17 +720.63/s (n=186616) m// loop: 11 wallclock secs (10.48 usr + 0.00 sys = 10.48 CPU) @ 78 +73.90/s (n=82550)

    Update: Damn, I knew I was missing something (probably still am...) - thanks dws! I don't keep spam around so here's the results for a chunk of 12k of text-only e-mail (as spam tends to be long-winded). I skipped the hand-optimized regex as I used nonsense words in the next case to ensure no match would be found and I didn't feel like thinking about it too hard. In both the following cases there are still just 7 words in the match list.

    So, for no match:

    Benchmark: running BigRE, Index, OptRE, m// loop, each for at least 10 + CPU seconds... BigRE: 10 wallclock secs (10.59 usr + 0.00 sys = 10.59 CPU) @ 77 +.59/s (n=822) Index: 11 wallclock secs (10.50 usr + 0.00 sys = 10.50 CPU) @ 14 +27.58/s (n=14991) m// loop: 11 wallclock secs (10.47 usr + 0.00 sys = 10.47 CPU) @ 68 +6.50/s (n=7187)
    With matches:
    Benchmark: running BigRE, Index, OptRE, m// loop, each for at least 10 + CPU seconds... BigRE: 10 wallclock secs (10.63 usr + 0.00 sys = 10.63 CPU) @ 20 +25.69/s (n=21523) Index: 11 wallclock secs (10.47 usr + 0.00 sys = 10.47 CPU) @ 19 +019.11/s (n=199092) OptRE: 11 wallclock secs (10.51 usr + 0.00 sys = 10.51 CPU) @ 27 +02.33/s (n=28415) m// loop: 11 wallclock secs (10.45 usr + 0.00 sys = 10.45 CPU) @ 95 +93.84/s (n=100294)

    And the test code:

      How do the methods compare if the target string is longer? Say, about the size of an average piece of SPAM.

        I've thought about this... To make this work I'll have to chunk the file because I don't want to keep the whole email in memory -- I'd rather stay lean. I could chunk the thing into 100 line pieces, but this makes the code messier.
Re: regex is too long
by suaveant (Parson) on May 08, 2001 at 21:31 UTC
    You may also want to look into the little used perl function study, which is good to run on a string that you want to many matches against, but not actually change the string.
                    - Ant
    study
    by jhanna (Scribe) on May 09, 2001 at 00:51 UTC
      Study works on the string being searched, not the regular expression. The string being searched is relatively short (ie < 180 chars), and is different every time. (Ok 50% different each time).
        Ahh, I thought you were running like, 40 regexps on a string, in which case it could help...
                        - Ant