Getting the number of times a regexp matches

MeowChow has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: Getting the number of times a regexp matches by extremely (Priest) on Dec 07, 2000 at 15:54 UTC
Benchmark time! Whoohoo! With warnings and strict off and a brain dead match. Re-Updated with t0mas' golfing and the quidity/dchetlin golf `#!/usr/bin/perl use Benchmark qw(cmpthese); use vars qw($g $c $p); $p = "b"; #$p = "[ba]"; #$p = "c"; $g = "abababababababababababababababababababababababababababababa"; cmpthese (-10, { 'map' => '$c = scalar map {1} ($g =~ m/$p/g);', 'array' => '$c = scalar @{[$g =~ m/$p/g]};', 's//' => '$c = ($g =~ s/($p)/$1/g);', 'while' => '$c++ while ($g =~ m/$p/g);', 'split' => '$c= (scalar split /$p/,$g) +($g=~/$p$/)-1;', '@_' => '@_=($g =~ m/$p/g) and $c=1+$#_;', '()' => '$c=()=$g=~/$p/g;', });` [download] Rate s// @_ map array () while split s// 1028/s -- -38% -48% -48% -54% -68% -77% @_ 1653/s 61% -- -16% -17% -26% -49% -63% map 1971/s 92% 19% -- -1% -12% -39% -56% array 1995/s 94% 21% 1% -- -11% -38% -56% () 2240/s 118% 35% 14% 12% -- -31% -50% while 3243/s 215% 96% 65% 63% 45% -- -28% split 4486/s 336% 171% 128% 125% 100% 38% -- Ok, the split solution kicks deprecated errors with -w and abusing it is crappy anyway, BUT oh mama is it fast on my machine. Personally, I'd recommend the while solution as safe and clean. It's a shame I can't ++ mirod twice! Oh yeah, if the match fails (match "c") the results are a little different: Rate split array s// while map @_ () split 14203/s -- -43% -63% -71% -72% -74% -76% array 24826/s 75% -- -36% -48% -52% -55% -59% s// 38899/s 174% 57% -- -19% -24% -30% -35% while 48184/s 239% 94% 24% -- -6% -13% -20% map 51481/s 262% 107% 32% 7% -- -7% -15% @_ 55225/s 289% 122% 42% 15% 7% -- -8% () 60282/s 324% 143% 55% 25% 17% 9% -- while is warning safe, fast and has a cheap setup in mismatch cases. Plus, the more complex the match, the worse split will get: Matching against [ab] for example: Rate s// @_ map array () split while s// 657/s -- -10% -22% -25% -27% -43% -47% @_ 729/s 11% -- -14% -17% -19% -36% -42% map 846/s 29% 16% -- -4% -6% -26% -32% array 877/s 33% 20% 4% -- -2% -24% -30% () 898/s 37% 23% 6% 2% -- -22% -28% split 1148/s 75% 57% 36% 31% 28% -- -8% while 1249/s 90% 71% 48% 42% 39% 9% -- -- Updated. I still think the cleanest of the bunch is the while variation and it is surely showing its colors in ranking up near the top in all the variations. The ()s and array hacks stay right in there tho and both are clear and/or simple as well. As a final test, I passed the match 'c\|\d+\|ab' against my /var/log/lastlog (300KB) and this is what I got: s/iter s// while split () array map @_ s// 1.14 -- -4% -4% -5% -5% -5% -6% while 1.10 4% -- -0% -1% -1% -2% -3% split 1.10 4% 0% -- -1% -1% -1% -3% () 1.09 5% 1% 1% -- -0% -0% -2% array 1.09 5% 1% 1% 0% -- -0% -2% map 1.08 5% 2% 1% 0% 0% -- -1% @_ 1.07 7% 3% 3% 2% 2% 1% -- snort I'll le tyou draw your own conclusions. $you = new YOU; honk() if $you->love(perl) p.s. this post, my 321st, made me a bishop =)	[reply] [d/l]
(TMTOWTDI) Re (2): Getting the number of times a regexp matches by mwp (Hermit) on Dec 07, 2000 at 17:23 UTC
Darn. I was proud of the `@{[]}` trick I hacked together for this problem, too bad it scored so poorly. Ah well, thanks for the benchmarks extremely, and congrats on the promotion. {g} This kind of reminds me of a show on A&E I caught a few minutes of the other day. It had Jeremy Irons in it and he was trying to rebuild an old clock, either from an old schematic or model, I'm not sure which. At any rate, it was one of the first shipboard clocks, one to counteract the effect the swaying deck had on the pendulum. At one point, he becomes irate, saying "...it's a terrible mess, layer and layer of complexity, one piece correcting for the last. The man absolutely refused to admit he was wrong and come up with other concepts." Or something to that effect. =) I just thought that fit nicely in with this. Presented with a problem and current behavior (m//g returns a list of matched values in list context) I used the ol' hammer-and-nail routine. It seemed to work well enough and made absolute sense to me. But some other folks went back to the root of the problem and came up with completely different solutions that worked from an oblique angle. Look at mirod's solution, for example, something I would have never even thought of. Amazing. The nature of Perl, I suppose... 'kaboo	[reply] [d/l]
Re: Getting the number of times a regexp matches by mirod (Canon) on Dec 07, 2000 at 14:42 UTC
Here are some more solutions: `$count++ while ($str=~ /$pattern/g); # simple` [download] or `$count= (scalar split /$pattern/, $str ) + ($str=~/$pattern$/) # or it will not be counted - 1; # so it's simpler` [download]	[reply] [d/l] [select]
Re: Re: Getting the number of times a regexp matches by chipmunk (Parson) on Dec 07, 2000 at 19:07 UTC
The split solution is not correct, because it does not account for multiple occurences of $pattern at the end of the string: `$str = 'ababbb'; $pattern = 'b'; $count = (scalar split /$pattern/, $str ) + ($str=~/$pattern$/) - 1; print "$count\n";` [download] `2` [download] Fortunately, this is an easy problem to fix: `$count = (scalar split /$pattern/, $str, -1) - 1;` [download] The third argument to split specifies the maximum number of pieces to split the string into. A negative number turns off the stripping of null fields from the end of the list, without limiting the number of pieces.	[reply] [d/l] [select]
Re: Re: Getting the number of times a regexp matches by t0mas (Priest) on Dec 07, 2000 at 15:41 UTC
I don't think that the second one will be correct for strings like "blue blue blue and blue again" and patterns like "^blue"... Update: Sorry mirod, I misread that one :) Lets all have a try at it: `@_=($str =~ m/$pattern/g) and $count=1+$#_;` [download] /brother t0mas	[reply] [d/l]
Re: Re: Re: Getting the number of times a regexp matches by mirod (Canon) on Dec 07, 2000 at 15:47 UTC
Actually it is, `$count` gets set to 1	[reply]
Re: Getting the number of times a regexp matches by quidity (Pilgrim) on Dec 07, 2000 at 17:59 UTC
Everyone so far seems to have missed this bit of evil context bashing: `$num_matches = () = $string =~ m/pattern/g;` This works because the () force the far right hand side to be evaluated in list context, the result of which is then reevaluated in scalar context to give the result. This is a nice example using side effects to good cause in perl.	[reply] [d/l]
(tye)Re: Getting the number of times a regexp matches by tye (Sage) on Dec 07, 2000 at 22:39 UTC
Right answer, but I don't approve of your explanation. The reason that you get a count is because a list assignment in a scalar context returns the number of elements on its right-hand side. There are a ton of other "operations that would return a list if used in a list context" [often sloppily (: referred to simply as "lists" ] that would return different information if used in a scalar context. - tye (but my friends call me "Tye")	[reply]
Re: Re: Getting the number of times a regexp matches by japhy (Canon) on Dec 07, 2000 at 18:30 UTC
But beware of large list assignments. `japhy` -- Perl and Regex Hacker	[reply]
Re: Re: Getting the number of times a regexp matches by MeowChow (Vicar) on Dec 07, 2000 at 23:53 UTC
This is an interesting construct, but I don't understand why Perl permits a constant, in this case, the empty list `()`, to be used in the LHS expression? For example, the following code spits out an error (`Can't modify constant item in list assignment`): `(1) = (1,2,3);` [download] so why is the following legal: `() = (1,2,3);` [download] Why is empty list not treated as a constant?	[reply] [d/l] [select]
(tye)Re2: Getting the number of times a regexp matches by tye (Sage) on Dec 08, 2000 at 00:39 UTC
First, because (1) contains a constant (the "1" part) and () contains no constants. So there are no constants being modified so why give an error complaining about you trying to modify no constants. Second, as Perl is implemented, I detect a clear preference toward not disallowing things even if the implementor can't think of a good use for that thing at the time. This makes sense for a TIMTOWTDI language. Third, we've just demonstrated a use for it. So it is a good thing it wasn't disallowed just because the use wasn't obvious at the time. I suspect that this working was at least partially an accident. The list assignment code was written and tested and it worked. I doubt anyone tested this degenerate list assignment. In fact, searching the standard Perl test suite, I find that this feature is not tested but it is used when testing another feature: `# Should use magical autoinc only when both are strings print "not " unless 0 == (() = "0"..-1); print "ok 14\n"; for my $x ("0"..-1) { print "not "; } print "ok 15\n";` [download] So there! q-: - tye (but my friends call me "Tye")	[reply] [d/l]
Re: Re: Re: Getting the number of times a regexp matches by Fastolfe (Vicar) on Dec 08, 2000 at 00:47 UTC
I suspect it has something to do with the way these things do work: `($a, $b, $c) = (1, 2, 3); ($a, $b) = (1, 2, 3); # 3 discarded ($a, undef, $c) = (1, 2, 3); # 2 discarded ($a, undef) = (1, 2, 3); # 2 and 3 discarded (undef, undef, undef) = (1, 2, 3); # 1 2 and 3 discarded () = (1, 2, 3); # functionally equivalent` [download] "undef" is the only real non-variable value you can use on the left-hand-side like that.	[reply] [d/l]
Re: Getting the number of times a regexp matches by dchetlin (Friar) on Dec 07, 2000 at 18:33 UTC
No one mentioned the canonical and most Perlish way to do this: `my $count = () = $str =~ m/pattern/g;` I suspect it offers little benefit in terms of efficiency, and can be confusing contextually. I do feel that it's the most succinct and elegant, however. Update: My apologies; while I was writing this and getting distracted, it was mentioned ahead of me, and an interesting discussion of it followed. Ah well. -dlc	[reply] [d/l]