grinder has asked for the wisdom of the Perl Monks concerning the following question:

I am pretty hopeless at regexps; my eyes glaze over at anything more complicated than (?:foo). That said, I have a performance problem. I'm looking at several gigabytes of log files and I want to extract "interesting" lines. Stripping the problem down to the absolute barebones, the problem looks like this:

#! /usr/bin/perl -w my $target = shift || 'word'; my $re = qr/a=<(.*?$target.*?)>/; while( <DATA> ) { print "[$1]\n" if /$re/; } __DATA__ a=<wordy> b=<rappinghood> a=<thisword> b=<thatword> a=<foreword> b=<junk> a=<nothing> b=<word> b=<wordplay> a=<swords> b=<end>

In other words, I'm interesting in a= lines that contain word (or whatever I specify) anywhere between angle brackets. If I do see it, then I want everything between the angle brackets. (And in other variants I might be interesting in a= and b=, but that's another story). The above program emits:

[wordy] [thisword] [foreword] [swords]

I'm using .*? to elide the remaining characters between the my target and delimiters, however, when I look at the behaviour of the match engine (with use re 'debug') I see lots of backtracking.

My question is this: is it possible to write a regexp for this problem that does not involve backtracking? The other question is whether it improves performance to any significant degree...

The corollary is that, hopefully, given I understand the problem domain, the solution might help me understand regexps better, and possibly be able to transpose it to other problems in the future.

Thanks.

Update: (meta-note: hooray for being able to update SoPWs!) Duh! Seconds after creating this node, it of course dawns on me that I can write

my $re = qr/a=<(.*?$target[^>]*)>/;

This is probably half the issue... but what about the leading characters between the opening delimiter and the target?

Replies are listed 'Best First'.
Re: Removing backtracking from a .*? regexp
by Art_XIV (Hermit) on Nov 17, 2003 at 20:36 UTC

    How about the following?

    use strict; my $target = shift || 'word'; my $re = qr/a=<(.*?$target.*?)>/; while( <DATA> ) { if (index($_, $target) >= 0 && /$re/) { #short-circuit regex! print "[$1]\n"; } }

    The poor, oft-ignored index function will filter out lines that don't contain 'word' at all, probably w/fewer cycles than running a regexp. Your mileage may vary, of course.

    Hanlon's Razor - "Never attribute to malice that which can be adequately explained by stupidity"
      The index() isn't really all that much faster than a regexp. BrowserUK was testing this once, I played with the benchmark and noticed that the index test was a percentage point faster. At this point I'd prefer people just recommend the regexp since it is as fast and is notationally less complex.

        Cool beans if so! Sounds like a good candidate for a Meditation.

        Hanlon's Razor - "Never attribute to malice that which can be adequately explained by stupidity"
Re: Removing backtracking from a .*? regexp
by ysth (Canon) on Nov 17, 2003 at 18:50 UTC
    You can suppress backtracking in the .*? with (?>.*?), but that always makes that part match zero characters. It really has to backtrack into the .*? to try grabbing more characters before matching $target. Does that make sense?

    (?>.*?$target) may be what you want, but I'm not sure it will make any difference once you've switched to using [^>]* in the later part.

Re: Removing backtracking from a .*? regexp
by Anonymous Monk on Nov 17, 2003 at 18:44 UTC

    Depending on the complexity of your $target and your data, you may find a gain by breaking your search up:

    my $target = shift || 'word'; $target = qr/$target/; my $re = qr/a=<([^>]+)>/; while( <DATA> ) { next unless /$re/; my $s = $1; print "[$s]\n" if $s =~ /$target/; }
Re: Removing backtracking from a .*? regexp
by Roger (Parson) on Nov 18, 2003 at 00:45 UTC
    For curiosity on the performance issue, I have conducted some experiments to compare the performance between a straight regular expression with capture and positive look-ahead with substr & length functions.

    The following is my code for the experiment -
    use strict; use Benchmark qw/timethese cmpthese/; # use re 'debug'; chomp(my @lines = <DATA>); my $target = shift || 'word'; my $re_positive_lookahead = qr/^a=<(?=.*?word)/; my $re_loose = qr/a=<(.*?$target.*?)>/; cmpthese( timethese(1000000, { 'Match_L' => '&Match_Loose', 'Match_P' => '&Match_PLAhead_plus_Substr', }) ); sub Match_Loose { foreach (@lines) { if (/$re_loose/) { my $word = $1; } } } sub Match_PLAhead_plus_Substr { foreach (@lines) { if (/$re_positive_lookahead/) { my $word = substr($_, 3, length($_)-4); } } } __DATA__ a=<swords> a=<wordy> b=<rappinghood> a=<thisword> b=<thatword> a=<foreword> b=<junk> a=<nothing> b=<word> b=<wordplay> b=<end>
    And the result of the experiment is as follows -
    Benchmark: timing 1000000 iterations of Match_L, Match_P... Match_L: 31 wclock secs (31.16 usr + 0.00 sys = 31.16 CPU) @ 32092.4 +3/s Match_P: 27 wclock secs (27.64 usr + 0.00 sys = 27.64 CPU) @ 36179.4 +5/s Rate Match_L Match_P Match_L 32092/s -- -11% Match_P 36179/s 13% --
    The observation is that positive lookahead combined with substr is about 12% faster than straight regexp with capture, which is not an insignificant speed improvement.

      The observation is that positive lookahead combined with substr is about 12% faster than straight regexp with capture, which is not an insignificant speed improvement.

      You are testing a highly sanitized version of the problem. The OP reduced it a barebones example. Surely the OP's logs do not begin with, nor even contain, the data shown, making it more complex to use substr to grab the right chunk of string. Secondly, your speed gain disappears entirely when you remove trailing backtracking from the loose regex as mentioned earlier. Try it with the this regex:

      my $re_loose = qr/a=<(.*?$target[^>]+)>/;

      Last, but most certainly not least: It is far too easy to reach unjustfied conclusions when benchmarking.

Re: Removing backtracking from a .*? regexp
by BrowserUk (Patriarch) on Nov 18, 2003 at 01:43 UTC

    Perhaps the simplest way is to take two bites at the cherry.

    if( $string =~ m[^a=<([^>]*)>] and $1 =~ m[$target] ) { # Do something }

    This is often, though not always, the quickest approach.


    Examine what is said, not who speaks.
    "Efficiency is intelligent laziness." -David Dunham
    "Think for yourself!" - Abigail
    Hooray!
    Wanted!

Re: Removing backtracking from a .*? regexp
by Anonymous Monk on Nov 18, 2003 at 20:47 UTC
    I think I understand what you're talking about, but I'm not sure. You see, the problem is the way you framed the question. The first line of the post suggests that you have a performance problem. In general, a performance problem is one in which systems are being overloaded or time constraints aren't being met.

    From the rest of the post however, I get the impression that this is just something you do every now and then, and that its just a little slow and maybe you have to go make a cup of coffee while it runs.

    The two are very different. If you are hunting this data regularly in a constrained environment, there are numerous techniques you can use to boost the performance of your search, including sorted character and pair indexes etc.

    On the other hand, these will take effort to implement. If you're just getting bored of waiting and want a faster regex, the one you put in the update is probably going to be about it. You won't be able to remove the backtracking entirely, in a worst case example imagine you're looking for "teatea" in the string "teateteatea". Its going to backtrack no matter what regular expression you use to pull off the match.

    The other viable alternative is to push all the data into a decent database and let it worry about it. All depends on how often the data changes and how often you do the searches.

    It's a pity that you didn't specify those factors :/