Re^2: Looking for ideas on how to optimize this specialized grep

I didn't understand s and m switch of regex until furry_marmot's explanation... man perlre says about /ms

'let the "." match any character whatsoever, while still allowing "^" and "$" to match, respectively, just after and just before newlines within the string'

I didn't think of example that needs this. Do you have any example case like 'little princess' example for /ms?

As for block mode of this example, I saw this way in awk script. I first met this way($\='') in perl.

People sometimes say regex is slow, so I tried to use index function insted of regex. But it seems not improving time. I simplified just to pick up From address in this example and index version needs utf8 treatment for index and substr.

use strict;
use warnings;

use File::Find;
use Data::Dumper;

my %addresses;

sub test1 {
    my ($from);
    find(sub {
        return unless -f $_;

        open my $fh, '<', $_ or die;
            local $/ = '';      # "Paragraph" mode, reads a block of t
+ext to next \n\n
            $_ = <$fh>;         # Read Header block

            ($from)= $_ =~ /^From:(.*)/m; # /m to anchor
            #print "$from\n";

        close $fh;
    }, glob('./009_mailtest/*'));

    #print Dumper \%addresses;

}
sub test2{
    binmode(STDOUT,":utf8");
    my ($from,$bgn,$end,$len);

    find(sub {
        return unless -f $_;

        open my $fh, '<:utf8', $_ or die;
            local $/ = '';      # "Paragraph" mode, reads a block of t
+ext to next \n\n
            $_ = <$fh>;         # Read Header block

            $bgn=index($_,"From:",0) + length("From:");
            $end=index($_,chr(10),$bgn+1);
            $len=$end - $bgn;
            $from=substr($_, $bgn, $len);
            #print "$from\n";
        close $fh;
    }, glob('./009_mailtest/*'));
}

my($start,$end);
$start=(times)[0];
&test1;
$end=(times)[0];
print "with regex=" . ($end - $start) . "sec\n";

$start=(times)[0];
&test2;
$end=(times)[0];
print "without regex=" . ($end - $start) . "sec\n";
[download]

The result for my 319Mb test mail box was like this.

with regex=0.296875sec
without regex=0.34375sec

Comment on Re^2: Looking for ideas on how to optimize this specialized grep Download Code

Replies are listed 'Best First'.
Re^3: Looking for ideas on how to optimize this specialized grep by furry_marmot (Pilgrim) on Jan 25, 2011 at 21:16 UTC
>> I didn't understand s and m switch of regex until furry_marmot's explanation... Thanks. Actually, they confused me for a long time when I was first learning Perl. I finally got it when I read Jeffrey Friedl's Mastering Regular Expressions; but I've always found a good example goes a loooong way. Some things to remember: `.+` and `.` are greedy. They look as far forward as they can and then work backwards to find the largest match possible (see example below). `.+?` and `.?` are not greedy. They search forward from the current string position to find the earliest match possible. These are slower (I forget by how much), but sometimes they are what you need. /s allows `'.'` to match newlines, so `.+` will look all the way to the end of whatever you're searching, whether it's a few characters, or several Kb of text, and then starts working backwards. Without /s, it only looks to the next newline to start looking back. /m is shorthand for (though not quite identical to) anchoring on a newline, but it can be useful to think of embedded lines in a block of text instead of thinking of a bunch of text and newlines all jumbled together. >> Do you have any example case like 'little prince' example for /ms? Sure. Here's an email header I pulled out of my spam catcher, with a bunch of regexes to illustrate. $text = <<'EOT'; Message-ID: <ODM2bWFpbGVyLmRpZWJlYS40MjYyNjE2LjEyOTU1NDE2MTg=@out-p-h. +customernews.net> From: "GenericOnline Pharmacy" <marmot@furrytorium.com> To: "Angie Morestead" <marmot@furrytorium.com> Subject: Buy drugs online now! Date: Thu, 20 Jan 2011 18:40:18 +0200 Content-Type: multipart/related; boundary="----=_Weigard_drugs_CG_0" EOT $text =~ /^Subject:.+drugs/m; # Anchor just after \n, before Subject. # Matches 'Subject: Buy drugs' $text =~ /\nSubject:.+drugs/; # Equivalent $text =~ /^Subject:.+drugs/ms; # '.' matches newlines, all the way to # '..._Weigard_drugs', which is not wh +at we wanted. $text =~ /^Subject:.+?drugs/ms; # '.' matches newlines, but searches f +rom current string # position, stopping when it matches ' +Subject: Buy drugs'. # This is a little slower than the fir +st two, but # equivalent. /s is countered by the . ++?, but if 'drugs' # was not in the Subject line, the reg +ex would keep keep # on going. # Here are some fun ones. # The email address should be "Furry Marmot" <marmot@furrytorium.com>, + or just # marmot@furrytorium.com. Anything else is spam. print "Spam!!!\n" if $text =~ /^(?:From\|To):\s"(?!.+Furry Marmot)[^"]" <marmot\@fu +rrytorium\.com>/m; # Regarding the [^"], if the regex finds Furry Marmot in quotes, it f +ails and this isn't # spam. But if it finds something else, we still have to match somethi +ng between the # quotes, and then match the email to determine if it is spam. # I should never see anything from me, to me. print "Spam!!!\n" if $text =~ /(?=^From:[^\n]+marmot\@furrytorium\.com).+^To:[^\n]+marm +ot\@furrytorium\.com/ms; # This starts at the beginning of header block, finds From: line with +my email address, # resets to start of block (because of zero-width lookahead assertion) +, then finds To: # line with my email address. It is the equivalent of... if ($text =~ /^From:.+marmot\@furrytorium\.com)/m && /^To:.+marmot\@fu +rrytorium\.com/m) { print "Spam!!!\n" } # ...but I can include the single pattern in a list of patterns that I + might want to match # against the string. [download] >> People sometimes say regex is slow* It depends on how it's used. The regex engine is actually pretty quick, but there are certain things that can really slow it down. It's been a while since I read Friedl's book, but basically the search engine looks for the start of a pattern, and then tries to find the rest. If the rest is not there, it backs out of what it was able to match and goes looking again. So just searching for `/^From:.+marmot/m`, it will first look for the beginning of the text, and then look at each character for a newline. Once it has that, it looks to see if the next character is an 'F'. If not, it backtracks and searches for the next newline. Once it finds 'From:', it looks again for a newline (because we're not using /s), and works back to see if it can find 'marmot'. If not, it backs out of the 'From:' it has matched so far and goes looking for another 'From:' line. More complex searches can cause it to backtrack up a storm. But a well-constructed regex can minimize that. Index is probably faster at searching for plaintext, but it can't search for patterns, which limits its usefulness. --marmot	[reply] [d/l] [select]
Re^4: Looking for ideas on how to optimize this specialized grep by remiah (Hermit) on Jan 26, 2011 at 08:00 UTC
It took me long time to understand backtracking supress ? and regex like `'[^"]'` to supress backtracking, so extended regex will take some time for me. Your example `"Spam!!!\n" if $text =~ /^(?:From\|To):\s"(?!.+Furry Marmot)[^"]*" <marmot\@furryt +orium\.com>/m;` [download] is greek for me now, but sometime I will understand extend regex. `print "Spam!!!\n" if $text =~ /(?=^From:[^\n]+marmot\@furrytorium\.com).+^To:[^\n]marmo +t\@furrytorium\.com/ms;` [download] this example of /ms and your explanation give me a clue for what is "zero width"	[reply] [d/l] [select]
Re^5: Looking for ideas on how to optimize this specialized grep by furry_marmot (Pilgrim) on Jan 26, 2011 at 18:30 UTC
Heh. That was tricky to get to work right. But let me save you some reading. First, `[^"]` doesn't suppress backtracking. It's just matches zero or more of anything that isn't a quote. `[^...]` is a negation class; the carat means don't match any of the characters between the square brackets. Now let me explain zero-width lookaheads so you know what that code is about. When you do a match against $string, Perl keeps track of the offset from the start of the string, which you can get (or set, actually) with pos($string). It makes more sense when you are doing multiple matches against the same string. Let's say I want to collect all the peppers in the following: `$s = "I'm a pepper, he's a pepper, you're a pepper, she's a pepper.. +."; while( $s =~ /(pepper)/g ) { push @peppers, $1; }` [download] This will put 4 peppers in the array @peppers. As you probably know, the /g modifer tells the match to remember the position after* the last match, and the next time through the loop, start looking for another match from that point. So... `I'm a pepper, he's a pepper, you're a pepper, she's a pepper... ^ ^ ^ ^ ^ 0 1 2 3 4` [download] ...the first time through the loop, pos($s) is 0. After matching the first time, the offset is 12, at position 1. After the next match, the offset is at position 2, and so on until there are no more matches. A zero-width lookahead means a) do the match and b) if successful, put the offset back where it was before you started. So, in this example... `$s = "Wouldn't you like to be a pepper too?"; # ^ ^ # 0 1 $s =~ /(?=pepper).+like/;` [download] ...the regex starts searching from the start of the line by default and matches pepper. But it doesn't change the offset to position 1. Instead, it leaves it at position 0, where it searches forward to find 'like'. `$s =~ /pepper.+like/` would have failed because after matching pepper, the offset would be at position 1, and searching forward won't find 'like'. The code is the equivalent of `$s =~ /pepper.+like\|like.+pepper/`. It's more useful when parsing complex phrases, like language, where a verb, for example, can be followed by more than one type of word or phrase. But getting back to your post: `print "Spam!!!\n" if $text =~ /^To: \s* " (?!.+Furry Marmot) [^"]* " <marmot\@furrytorium\.com> /mx;` [download] A negative lookahead is like the positive lookahead, above, but succeeds when the search term is not found. In the middle of the regex above, I want the match to succeed if there are two quotes, but they do not contain 'Furry Marmot'. It won't work if I try to match `"(?!.+Furry Marmot)"` because that says a) find a double-quote, b) don't find 'Furry Marmot' and then leave the offset just after the quote, and c) find the closing quote. This can only match `""`. Instead, once we have determined that Furry Marmot is not after the first quote, match zero or more of anything that isn't another quote, up to the closing quote. Now we can check what's in the email address. This is just a simplistic example, and probably would be overkill if you tried to accomodate multiple addressees, a CC: or BCC: line, etc., but I hope it helps you learn regexes. They are one of my favorite parts of Perl. `:-)` There's a very good description of backtracking in perldoc perlretut. That and perldoc perlre will shed a lot of light on this. --marmot	[reply] [d/l] [select]
Re^6: Looking for ideas on how to optimize this specialized grep by remiah (Hermit) on Jan 27, 2011 at 06:04 UTC
Re^7: Looking for ideas on how to optimize this specialized grep by furry_marmot (Pilgrim) on Jan 28, 2011 at 00:55 UTC
Some notes below your chosen depth have not been shown here