in reply to Re^2: Looking for ideas on how to optimize this specialized grep
in thread Looking for ideas on how to optimize this specialized grep
>> I didn't understand s and m switch of regex until furry_marmot's explanation...
Thanks. Actually, they confused me for a long time when I was first learning Perl. I finally got it when I read Jeffrey Friedl's Mastering Regular Expressions; but I've always found a good example goes a loooong way.Some things to remember:
>> Do you have any example case like 'little prince' example for /ms?
Sure. Here's an email header I pulled out of my spam catcher, with a bunch of regexes to illustrate.$text = <<'EOT'; Message-ID: <ODM2bWFpbGVyLmRpZWJlYS40MjYyNjE2LjEyOTU1NDE2MTg=@out-p-h. +customernews.net> From: "GenericOnline Pharmacy" <marmot@furrytorium.com> To: "Angie Morestead" <marmot@furrytorium.com> Subject: Buy drugs online now! Date: Thu, 20 Jan 2011 18:40:18 +0200 Content-Type: multipart/related; boundary="----=_Weigard_drugs_CG_0" EOT $text =~ /^Subject:.+drugs/m; # Anchor just after \n, before Subject. # Matches 'Subject: Buy drugs' $text =~ /\nSubject:.+drugs/; # Equivalent $text =~ /^Subject:.+drugs/ms; # '.' matches newlines, all the way to # '..._Weigard_drugs', which is not wh +at we wanted. $text =~ /^Subject:.+?drugs/ms; # '.' matches newlines, but searches f +rom current string # position, stopping when it matches ' +Subject: Buy drugs'. # This is a little slower than the fir +st two, but # equivalent. /s is countered by the . ++?, but if 'drugs' # was not in the Subject line, the reg +ex would keep keep # on going. # Here are some fun ones. # The email address should be "Furry Marmot" <marmot@furrytorium.com>, + or just # marmot@furrytorium.com. Anything else is spam. print "Spam!!!\n" if $text =~ /^(?:From|To):\s*"(?!.+Furry Marmot)[^"]*" <marmot\@fu +rrytorium\.com>/m; # Regarding the [^"]*, if the regex finds Furry Marmot in quotes, it f +ails and this isn't # spam. But if it finds something else, we still have to match somethi +ng between the # quotes, and then match the email to determine if it is spam. # I should never see anything from me, to me. print "Spam!!!\n" if $text =~ /(?=^From:[^\n]+marmot\@furrytorium\.com).+^To:[^\n]+marm +ot\@furrytorium\.com/ms; # This starts at the beginning of header block, finds From: line with +my email address, # resets to start of block (because of zero-width lookahead assertion) +, then finds To: # line with my email address. It is the equivalent of... if ($text =~ /^From:.+marmot\@furrytorium\.com)/m && /^To:.+marmot\@fu +rrytorium\.com/m) { print "Spam!!!\n" } # ...but I can include the single pattern in a list of patterns that I + might want to match # against the string.
>> People sometimes say regex is slow
It depends on how it's used. The regex engine is actually pretty quick, but there are certain things that can really slow it down. It's been a while since I read Friedl's book, but basically the search engine looks for the start of a pattern, and then tries to find the rest. If the rest is not there, it backs out of what it was able to match and goes looking again.So just searching for /^From:.+marmot/m, it will first look for the beginning of the text, and then look at each character for a newline. Once it has that, it looks to see if the next character is an 'F'. If not, it backtracks and searches for the next newline. Once it finds 'From:', it looks again for a newline (because we're not using /s), and works back to see if it can find 'marmot'. If not, it backs out of the 'From:' it has matched so far and goes looking for another 'From:' line.
More complex searches can cause it to backtrack up a storm. But a well-constructed regex can minimize that. Index is probably faster at searching for plaintext, but it can't search for patterns, which limits its usefulness.
--marmot
|
---|
Replies are listed 'Best First'. | |
---|---|
Re^4: Looking for ideas on how to optimize this specialized grep
by remiah (Hermit) on Jan 26, 2011 at 08:00 UTC | |
by furry_marmot (Pilgrim) on Jan 26, 2011 at 18:30 UTC | |
by remiah (Hermit) on Jan 27, 2011 at 06:04 UTC | |
by furry_marmot (Pilgrim) on Jan 28, 2011 at 00:55 UTC | |
by remiah (Hermit) on Jan 28, 2011 at 07:51 UTC |