sgifford has asked for the wisdom of the Perl Monks concerning the following question:

I was writing up an example of using mmap last night in Re: anyone have some good mmap examples?, and was surprised at how poorly my first version performed. It essentially mmaped in a fairly large file, then tried to match lines containing a pattern with:
while ($mmap =~ m/^([^\n]*$pat[^\n]\n)/omg) { print $1; }

I found performance was abysmal: 20 seconds to search a 400K file (/usr/share/dict/words) for "foo". Some investigation (and perl -Dr) revealed that the regex engine was finding "foo" in the string very quickly, then was trying every single line preceding it to see if it matched against the remaining regex. I rewrote my simple pattern to:

while ($mmap =~ m/$pat/omg) { my $pos = pos $mmap; # Find the beginning and end of this line my $first = rindex($mmap,"\n",$pos)+1; my $last = index($mmap,"\n",$pos)+1; print substr($mmap,$first,$last-$first); }
and runtime went from 20 seconds to .040 seconds.

My question is: is there anything I could have told Perl to give it a hint about how to perform the match, such as "Hey Perl, the ^ is going to be pretty close to $pat, so just search backwards from wherever you find it"?

Here's a full copy of the original program.

#!/usr/bin/perl use warnings; use strict; use Sys::Mmap; our $pat = shift; foreach my $a (@ARGV) { my $mmap; if (! -f $a) { warn "'$a' is not a regular file, skipping\n"; next; } open(F, "< $a") or die "Couldn't open '$a': $!\n"; mmap($mmap, 0, PROT_READ, MAP_SHARED, F) or die "mmap error: $!\n"; while ($mmap =~ m/^([^\n]*$pat[^\n]\n)/omg) { print $1; } munmap($mmap) or die "mmunmap error: $!\n"; close(F) or die "Couldn't close '$a': $!\n"; }

Replies are listed 'Best First'.
Re: Surprisingly poor regex performance
by Joost (Canon) on Dec 13, 2004 at 16:52 UTC
      What about m/(?:\A|\n)(.*$pat.*\n)/omg?

      Being right, does not endow the right to be rude; politeness costs nothing.
      Being unknowing, is not the same as being stupid.
      Expressing a contrary opinion, whether to the individual or the group, is more often a sign of deeper thought than of cantankerous belligerence.
      Do not mistake your goals as the only goals; your opinion as the only opinion; your confidence as correctness. Saying you know better is not the same as explaining you know better.

        Good stuff.. It's faster :-)

        #!/usr/local/bin/perl -w use strict; use Benchmark; my $pat = 'foo'; $/=undef; open F,"<","/usr/share/dict/words"; my $mmap = <F>; close F; timethese( 100, { # 'orig' => sub { # while ($mmap =~ m/^([^\n]*$pat[^\n]\n)/omg) { # warn $1; # } # }, 'fast' => sub { while ($mmap =~ m/\n?(.*$pat.*\n)/omg) { warn $1; } }, 'better' => sub { while ($mmap =~ m/(?:\A|\n)(.*$pat.*\n)/omg) { warn $1; } }, } ) __END__ Benchmark: timing 100 iterations of better, fast... better: 26 wallclock secs (23.13 usr + 0.06 sys = 23.19 CPU) @ 4 +.31/s (n=100) fast: 41 wallclock secs (39.73 usr + 0.09 sys = 39.82 CPU) @ 2 +.51/s (n=100)
        (original takes about 25 seconds per iteration on my machine)

      Ah, thanks, that makes a big difference (though it's still slower than using index/rindex). Do you have any explanation for why? My understanding was that ^ meant pretty much the same thing as \n?.

      dragonchild's code is even faster, in some cases by a factor of 3.

        Well, ^ doesn't mean the same as \n? at all. Like not even close.

        According to Programming Perl 3rd edition pages 150 and 159, ^, when used with the /m modifier, means to match after embedded newlines or the beginning of the string.

        \n? means something akin to possibly match a newline, but maybe not. That doesn't provide the engine with any understanding of where to match. Had you said something like /\n(.*$pat.*\n)/, you would have been better off (except for not matching the first line). Of course, with /m, you should be able to do /^(.*$pat.*)$/omsg and it should be about as efficient as my regex.

        With regexes, it's always better to be as explicit as possible. This will allow the engine to make a number of optimizations. Some of those optimizations, as you have found out, can mean the difference between 2500 seconds and 23 seconds.

        Being right, does not endow the right to be rude; politeness costs nothing.
        Being unknowing, is not the same as being stupid.
        Expressing a contrary opinion, whether to the individual or the group, is more often a sign of deeper thought than of cantankerous belligerence.
        Do not mistake your goals as the only goals; your opinion as the only opinion; your confidence as correctness. Saying you know better is not the same as explaining you know better.

        Good question. I guess using ^ to match newlines is probably just not very well optimized in the engine.

        The usual trick to make regexes faster is to be as explicit as you can be (like spelling out the newlines) and it helped here. I don't know enough about perl's regex internals to really explain why it works as well as it does in this case.

Re: Surprisingly poor regex performance
by ikegami (Patriarch) on Dec 13, 2004 at 16:47 UTC

    Shouldn't
    [^\n]\n
    be
    [^\n]*\n
    I don't know if that'll help your performance problem.

    This post may not have anything to do with mmap. Did you try it without using mmap?

      Thanks; I tried several versions, and at some point lost my *. Putting it back in didn't help performance at all, though. And I agree it has nothing at all to do with mmap; that's just a convenient way of getting a 400K scalar variable with interesting contents.
Re: Surprisingly poor regex performance
by tall_man (Parson) on Dec 14, 2004 at 18:24 UTC
    There is no need to do all this anchoring. ".*" does not match newlines (unless you use the /s option), so the following will also work (and should be just as fast):
    my $re = qr/(.*$pat.*)/; while ($mmap =~ m/$re/g) { print $1,"\n"; }
      You're right, that is roughly as fast as /(?:\A|\n)(.*$pat.*\n)/omg. But it's still slower than index/rindex.
        You were asking why the index/rindex is so good. It's because you're matching a pure literal string, which can use the highly-efficient Boyer-Moore search (the longer the literal string, the faster it tends to be).

        For some reason, the "^" is not being optimized properly as an anchor for "/mg" searches. Leaving it out entirely, or substituting the alternate "(?:\A|\n)" pattern, results in searches that are much faster. That seems like a bug in the optimizer.

Re: Surprisingly poor regex performance
by dws (Chancellor) on Dec 14, 2004 at 08:11 UTC

    My question is: is there anything I could have told Perl to give it a hint about how to perform the match...?

    I'm wondering if a divide-and-conquer approach might give you the same effect, while being clearer to Perl about how to match without needing to backtrack.

    while ( $mmap =~ /^(.*\n)/mg ) { my $line = $1; print $line if $line =~ /$pat/; }

Re: Surprisingly poor regex performance
by dragonchild (Archbishop) on Dec 13, 2004 at 16:52 UTC
    Well, are the two identical? From my reading, I don't think they are. The rewritten version finds "foo", then looks forward and backward to the first newline in both directions. The original version is looking for the beginning of a line after $pat ... ? That doesn't look right ...

    Update: It helps to have more than one cup of coffee on Monday mornings ... *sighs*

    /me really hates Mondays ...

    Being right, does not endow the right to be rude; politeness costs nothing.
    Being unknowing, is not the same as being stupid.
    Expressing a contrary opinion, whether to the individual or the group, is more often a sign of deeper thought than of cantankerous belligerence.
    Do not mistake your goals as the only goals; your opinion as the only opinion; your confidence as correctness. Saying you know better is not the same as explaining you know better.

Would study help this regexp's performance?
by Smylers (Pilgrim) on Dec 15, 2004 at 10:04 UTC

    Would using study help here?

    Note this is only a question, not a recommendation — I've never used study, but from its docs this sounds like a situation where it might help.

    Smylers

    Retitled by davido.
    Re-retitled by Smylers, so that it's grammatically correct again (and I don't look like an idiot who doesn't know to use apostrophes).

      study only really helps (ie. saves time) if you are going to be searching the studied string multiple times to offset the cost of the studying itself. And then only if your search term contains one or more characters that have rare occurance in the studied string.

      I've wondered whether study could be updated to take a parameter n, where it then builds the table from groups of n chars, triplets being more unique than doublets, and they more so than individual chars.

      Of course, the regex engine would need updating to make use of the information, and that's a very scary task to comtemplate.


      Examine what is said, not who speaks.        The end of an era!
      "But you should never overestimate the ingenuity of the sceptics to come up with a counter-argument." -Myles Allen
      "Think for yourself!" - Abigail        "Time is a poor substitute for thought"--theorbtwo         "Efficiency is intelligent laziness." -David Dunham
      "Memory, processor, disk in that order on the hardware side. Algorithm, algorithm, algorithm on the code side." - tachyon
        study only really helps (ie. saves time) if you are going to be searching the studied string multiple times to offset the cost of the studying itself

        And in this case, I'm searching the string only once, so it almost certainly would have hurt rather than helped.