Regex, capturing variables vs. speed

albert has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: Regex, capturing variables vs. speed by Zaxo (Archbishop) on Oct 29, 2005 at 21:28 UTC
The three use simpler regexen, so they run faster. A particular drain on your combined regex is the "`.`" (dot-star) you use to join the interesting ones. Each will cause the regex to match to the end of the line, then backtrack to find the following fixed text. "`.?`" would be a faster joining regex. See Ovid's Death to Dot Star! After Compline, Zaxo	[reply]
Re^2: Regex, capturing variables vs. speed by sauoq (Abbot) on Oct 30, 2005 at 21:46 UTC
A particular drain on your combined regex is the "." (dot-star) you use to join the interesting ones.* I wouldn't automatically point fingers at the `.` even though, as you say, `.?` would probably be better for him. The issue is how his use of `.` is combined with his use of `.?` when what he really means is `\S`. He'd probably see a significant speedup even if switching to `\S` was his only change, but incorporating your suggestion makes sense as well. So, I'd rewrite `my ($chr,$prot,$panel) = /(chr.?)\s.urn:lsid:(.?)\s.panel:(.?):/i;` as `my ($chr, $prot, $panel) = /(chr\S)\s.?urn:lsid(\S)\s.?panel:([^:] +):/i;` [download] (Note that changing the last `.?` to `[^:]` is a good idea too even if the efficiency gain isn't much. Update: Well, reading further down the thread, it seems robin beat me to it. -sauoq "My two cents aren't worth a dime.";	[reply] [d/l] [select]
Re: Regex, capturing variables vs. speed by Ovid (Cardinal) on Oct 29, 2005 at 21:33 UTC
Without seeing the text you're trying to match, my guess is that the "dot star" is causing backtracking issues. See Death to Dot Star! for more information. Update: Heh :) Zaxo beat me to it. Cheers, Ovid New address of my CGI Course.	[reply]
Re: Regex, capturing variables vs. speed by GrandFather (Saint) on Oct 30, 2005 at 10:33 UTC
Following on from the previous replies here is a benchmark demonstrating the performance difference. Note though that with the test string the speed difference is only of the order of two times, not the 10 times described by OP. use warnings; use strict; use Benchmark qw(cmpthese); my $target = 'This is a string used to test the time required for a gr +eedy match compared to a non-greedy match.'; my $greedy = qr/(\ba\b.\bstring\b)/; my $non = qr/(\ba\b.?\bstring\b)/; my ($matchG) = $target =~ $greedy; my ($matchN) = $target =~ $non; die "Matches generate different results\n" if $matchG ne $matchN; cmpthese ( -1, { 'Greedy' => sub {$target =~ $greedy;}, 'Non' => sub {$target =~ $non;} } ); Prints: Rate Greedy Non Greedy 162689/s -- -64% Non 456847/s 181% -- [download] Perl is Huffman encoded by design.	[reply] [d/l]
Re^2: Regex, capturing variables vs. speed by albert (Monk) on Oct 30, 2005 at 17:57 UTC
Thanks to all for feedback. I did get much more than 2x speed increase for greedy vs. not because my line to match is quite long. Taking what I've learned from the thread, I did some comparisons. use Benchmark qw/cmpthese/; my $line = 'rs11502186 C/G Chr11 170472 + ncbi_b34 perlegen urn:lsid:p +erlegen.hapmap.org:Protocol:Genotyping_1.0.0:2 urn:lsid:perlegen.hapm +ap.org:Assay:25763.7541533:1 urn:lsid:dcc.hapmap.org:Panel:CEPH-30-tr +ios:1 QC+ GG GG GG NN GG GG GG GG GG GG GG GG GG GG GG GG GG GG GG GG + GG GG GG GG GG GG GG GG GG GG GG GG GG GG GG GG GG GG GG GG GG GG NN + GG GG GG NN GG GG GG GG GG GG GG GG GG GG GG GG GG NN GG GG GG GG NN + GG GG GG GG GG GG GG GG GG GG GG GG GG GG GG GG GG GG GG GG GG GG GG + GG GG GG GG GG GG'; cmpthese(-1, { 'Greedy' => sub {$line =~ /(chr.?)\s.urn:lsid:(.?)\s.p +anel:(.?):/i}, 'Non' => sub {$line =~ /(chr.?)\s.?urn:lsid:(.?)\s.?pa +nel:(.?):/i}, 'Sep' => sub {$line =~ /(chr.?)\s/i; $line =~ /urn:lsid:(.?)\s/i; $line =~ /panel:(.*?):/i; }, 'Death_star' => sub {$line =~ /(chr[^\s]+)/i; $line =~ /urn:lsid:([^\s]+)/i; $line =~ /panel:([^:]+)/i; } } ); [download] Giving the following results: `Rate Greedy Sep Non Death_star Greedy 8650/s -- -95% -95% -97% Sep 157827/s 1725% -- -6% -41% Non 167020/s 1831% 6% -- -38% Death_star 267963/s 2998% 70% 60% --` [download] Killing the star is clearly the way to go. Thanks to the Monks which helped me learn something. -albert	[reply] [d/l] [select]
Re^3: Regex, capturing variables vs. speed by robin (Chaplain) on Oct 30, 2005 at 20:31 UTC
Even faster is to use a single match, but to be explicit about what you're looking for, i.e. `'Better' => sub { $line =~ /(chr\S).?urn:lsid:(\S).?panel:([^:])/i }` [download] It's always better to write `(\S)\s` than `(.*?)\s`, because you're making it clear to the matching engine exactly what you're looking for (non-space characters in this case).	[reply] [d/l]
Re^4: Regex, capturing variables vs. speed by albert (Monk) on Oct 30, 2005 at 21:06 UTC
Re^3: Regex, capturing variables vs. speed by GrandFather (Saint) on Oct 30, 2005 at 23:42 UTC
Interesting, there must be Perl differences too. The spread is not as great with Active State Perl v5.8.7. In particular, Better is not as much better. `Rate Greedy Sep Death_star Non Be +tter Greedy 9965/s -- -90% -94% -94% +-97% Sep 102700/s 931% -- -34% -37% +-66% Death_star 155342/s 1459% 51% -- -5% +-48% Non 164099/s 1547% 60% 6% -- +-46% Better 301485/s 2925% 194% 94% 84% + --` [download] The results above used OP's benchmark code (with the addition of Better). Perl is Huffman encoded by design.	[reply] [d/l]
Re: Regex, capturing variables vs. speed by GrandFather (Saint) on Oct 30, 2005 at 10:55 UTC
There must be something unusual about OP's data. A simple test gives very similar results for the two techniques: use warnings; use strict; use Benchmark qw(cmpthese); my $target = 'Some leading text' . 'chrMatchTextForChr ' . 'Some text to be ignored following chr ' . 'urn:lsid:UrnLisdMatchText ' . 'Text to be ignored after urn:lsid match text' . 'panel:Text: Some text that doesn\'t get matched by anything'; die "Different results" if all () ne parts (); cmpthese (-1, {'all' => \&all, 'parts' => \&parts}); sub all { my ($chr,$prot,$panel) = $target =~ /(chr.?)\s.urn:lsid:(.?)\s.pan +el:(.?):/i; return "$chr$prot$panel"; } sub parts { my ($chr) = $target =~ /(chr.?)\s/i; my ($prot) = $target =~ /urn:lsid:(.?)\s/i; my ($panel) = $target =~ /panel:(.?):/i; return "$chr$prot$panel"; } Prints: Rate parts all parts 82116/s -- -17% all 98443/s 20% -- [download] Perl is Huffman encoded by design.	[reply] [d/l]
Re^2: Regex, capturing variables vs. speed by Your Mother (Archbishop) on Oct 30, 2005 at 21:25 UTC
Maybe the version of perl, and the regex engine, being used by the OP differs?	[reply]
Re: Regex, capturing variables vs. speed by Anonymous Monk on Oct 30, 2005 at 14:07 UTC
Are you sure you're not secretly parsing XML using regex?	[reply]
Re^2: Regex, capturing variables vs. speed by GrandFather (Saint) on Oct 30, 2005 at 19:24 UTC
You mean that regex "know" when one is parsing XML and deliberatly and maliciously slow down as a result? That's pretty mean! Perl should at least generate a warning in that situation. Perl is Huffman encoded by design.	[reply]
Re: Regex, capturing variables vs. speed by Hexren (Initiate) on Oct 31, 2005 at 01:07 UTC
Although I am far too lazy to check with my oreilly right now I would hazard a guess that "." is the crucial point. Matching something like "chr.?\s" should be fairly easy. Discard anything that does not look like "chr" if you find something that looks like it. See if the next thing is a \s else get whatever it is and look if the thing after that thing is a \s. In comparison looking for "." is more expensive as that . tries to grap as much as possible and it has to play through all the possible combinations to see if it cann match something more. It could be summarized like "Matching greedy things is more expensive than matching non greedy things" Is that understandable (I do hope it is right somewhat ;) ? Regards Hexren	[reply]
Re: Regex, capturing variables vs. speed by Aristotle (Chancellor) on Nov 01, 2005 at 21:02 UTC
Here’s a quick tip: if you do a simple `perl -Mre=debug -le'"some test data here" =~ /your regex here/'` [download] on the command line, you will see a) how the regex engine compiles your pattern, including which optimisations it discovered to be applicable while compiling, and b) which steps it performs during actual matching, including which optimisations actually made a difference. (F.ex., seeing “Match rejected by optimizer” for input you want to reject is ideal, because it means the match failure was detected even before the engine was powered up.) This is an invaluable aid in understanding the performance characteristics of different patterns, as you can see just what the engine is really doing. Makeshifts last the longest.	[reply] [d/l]
Re^2: Regex, capturing variables vs. speed by monarch (Priest) on Nov 02, 2005 at 03:49 UTC
That's a great tool, and here's something I discovered by playing around. There's a difference between executing: `perl -Mre=debug -le'"lahblahblahblah" =~ /(.?lah)$1{2}/'` [download] and `perl -Mre=debug -le'"lahblahblahblah" =~ /(.?lah)\1{2}/'` [download] The first one matches on "lahblahblah", and the second one matches on "blahblahblah". Why do you suppose that is? The debugging engine appears to see the `$1` as an almost literal copy of the regexp in the brackets. Whereas the `\1` seems to be looking for whatever was matched by the regexp in the brackets.. Update: Thanks to Aristotle for pointing out the error in the `$1` interpretation.	[reply] [d/l] [select]
Re^3: Regex, capturing variables vs. speed by Aristotle (Chancellor) on Nov 02, 2005 at 04:39 UTC
The regex engine is not seeing any `$1`. A `$1` in a regex does not mean anything special, it’s just a variable that gets interpolated as a literal string. In your case, `$1` is empty when that pattern is compiled, so `/(.?lah)$1{2}/` becomes simply `/(.?lah){2}/`. You can see this if you read the compiler output carefully – the `CURLYX[0] {2,2}` applies to the parenthesised expression. Makeshifts last the longest.	[reply]
Re: Regex, capturing variables vs. speed by demerphq (Chancellor) on Nov 01, 2005 at 14:13 UTC
I can suggest a rationale for why the seperate regexes might be faster. The seperate regexes only need to use Boyer-Moore matching to find the start position for the constant strings in the patter, and then capture for the appropriate amount following that point. With the unified regex it will only find one of the constant strings (iirc, it will be 'chr') and then needs to match right from that point. Other changes to pattern will play a role in the various runtimes so the above is not the only analysis required, but it does give some insight into how the regex engine will handle your pattern. --- $world=~s/war/peace/g	[reply]