in reply to Re: regex is too long
in thread regex is too long

"Perl gets stuck" means that the when the process gets to the regex line in the code it never returns and appears (according to ps) to be consuming large amounts of CPU (ie endless loop). It only takes a few dozen of these to bring the system to its knees.

I won't post the entire table because it's long and rude. However, you'll quickly get the gist.

$evil= ## list of re phrases t 'barely legal Unsensored pics rated adult site (find out|learn|discover) ANYTHING about anyone (remove.*\@dcemail\.com) bagboy\@burmeses\.net \(a\)\s*\(2\)\s*\(C\).*1618 this limited time free offer thousands of extra dollars earn a great monthly income \bfat absorber\b chain letter.*pyramid scheme pyramid scheme.*chain letter e-mail\w* work\w*\! Earn BIG \$\$\$ block this remove account quit watching others get rich bulk email works! firmer erections vaginal lubrication s e x drive Enhances Orgasms eraseus@yahoo.com over \d+ million fresh email content-\w+: .* .*name\s*=\s*".*\.(exe|scr|pif|vbs)" And it\'s 100% LEGAL! No Hidden Fees'; ## actually is 3 x this long $evil=~s/\n/\|/g; $evil=~s/ +/ /g; # ... skipping ahead ... while($l=&getnextline) { $_="$l$lastline"; ## combine this line and the last s/\s+/ /g; ## simplify white space matching $isSpam = $isSpam || /$evil/io; # ... etc ... }
What is special about some strings, I don't know. I spent some time trying to debug it, and when I found that I could just break the regular expression into parts (ie $evil1, $evil2, $evil3) and it started working again I didn't spend much more time on it. There was no clear pattern to me why it was going into the endless loop. I'd be happy to email you the complete code and test data which causes it to break. I'm running Perl 5.005_03 (freebsd) and wondered if upgrading to the new release of perl would fix it (I read it resolved some re bugs).

Does that help?

Replies are listed 'Best First'.
Re (tilly) 3: regex is too long
by tilly (Archbishop) on May 09, 2001 at 05:17 UTC
    I suspect that you have some run-away patterns in there. Try reading Death to Dot Star!. After you read that you should have the background to see why patterns like:
    content-\w+: .* .*name\s*=\s*".*\.(exe|scr|pif|vbs)"
    are a lot of work to calculate. (Hit one content-type: and then you force a ton of scanning and backtracking through the whole string.)

    In fact I am going to guess that either that or another RE is really efficient. Perl's RE engine's optimizations are able to spot and fix it when they see the offending pattern in a small RE, but with a large one they give up analyzing before fixing the disaster. (It might, in fact, be that line.)

    And yes, it is possible that a new release of Perl will fix it. Ilya added more sanity checks to catch even more of these and may catch a disaster he missed before. But that will be less reliable than looking through your file for .*'s, and particularly .*'s that will force a lot of backtracking to happen. (At least make sure that between any 2 .*'s there is some meaningful text.)