MeowChow has asked for the wisdom of the Perl Monks concerning the following question:

Hi Fellas,

I'm looking for the most succinct and elegant way to get the number of times a regexp matches, using only one statement. I don't want to use a loop, or any unnecessary variables, and as always, efficiency counts. At first, I assumed one could simply scalar a list returned by the match, but using scalar forces the regexp itself into scalar evaluation mode, so the following won't work:

$count = scalar ($str =~ m/pattern/g);
A few working solutions from the CB were:
$count = scalar map {1} ($str =~ m/pattern/g); $count = scalar @{[$str =~ m/pattern/g]}; $count = ($str =~ s/pattern/$1/g);
But these seem either too unintuitive or inefficient for such a common task. Is there a better way?

Replies are listed 'Best First'.
Re: Getting the number of times a regexp matches
by extremely (Priest) on Dec 07, 2000 at 15:54 UTC
    Benchmark time! Whoohoo! With warnings and strict off and a brain dead match.

    Re-Updated with t0mas' golfing and the quidity/dchetlin golf

    #!/usr/bin/perl use Benchmark qw(cmpthese); use vars qw($g $c $p); $p = "b"; #$p = "[ba]"; #$p = "c"; $g = "abababababababababababababababababababababababababababababa"; cmpthese (-10, { 'map' => '$c = scalar map {1} ($g =~ m/$p/g);', 'array' => '$c = scalar @{[$g =~ m/$p/g]};', 's//' => '$c = ($g =~ s/($p)/$1/g);', 'while' => '$c++ while ($g =~ m/$p/g);', 'split' => '$c= (scalar split /$p/,$g) +($g=~/$p$/)-1;', '@_' => '@_=($g =~ m/$p/g) and $c=1+$#_;', '()' => '$c=()=$g=~/$p/g;', });
            Rate   s//    @_   map array    () while split
    s//   1028/s    --  -38%  -48%  -48%  -54%  -68%  -77%
    @_    1653/s   61%    --  -16%  -17%  -26%  -49%  -63%
    map   1971/s   92%   19%    --   -1%  -12%  -39%  -56%
    array 1995/s   94%   21%    1%    --  -11%  -38%  -56%
    ()    2240/s  118%   35%   14%   12%    --  -31%  -50%
    while 3243/s  215%   96%   65%   63%   45%    --  -28%
    split 4486/s  336%  171%  128%  125%  100%   38%    --

    Ok, the split solution kicks deprecated errors with -w and abusing it is crappy anyway, BUT oh mama is it fast on my machine. Personally, I'd recommend the while solution as safe and clean. It's a shame I can't ++ mirod twice!

    Oh yeah, if the match fails (match "c") the results are a little different:

             Rate split array   s// while   map    @_    ()
    split 14203/s    --  -43%  -63%  -71%  -72%  -74%  -76%
    array 24826/s   75%    --  -36%  -48%  -52%  -55%  -59%
    s//   38899/s  174%   57%    --  -19%  -24%  -30%  -35%
    while 48184/s  239%   94%   24%    --   -6%  -13%  -20%
    map   51481/s  262%  107%   32%    7%    --   -7%  -15%
    @_    55225/s  289%  122%   42%   15%    7%    --   -8%
    ()    60282/s  324%  143%   55%   25%   17%    9%    --

    while is warning safe, fast and has a cheap setup in mismatch cases. Plus, the more complex the match, the worse split will get:

    Matching against [ab] for example:

            Rate   s//    @_   map array    () split while
    s//    657/s    --  -10%  -22%  -25%  -27%  -43%  -47%
    @_     729/s   11%    --  -14%  -17%  -19%  -36%  -42%
    map    846/s   29%   16%    --   -4%   -6%  -26%  -32%
    array  877/s   33%   20%    4%    --   -2%  -24%  -30%
    ()     898/s   37%   23%    6%    2%    --  -22%  -28%
    split 1148/s   75%   57%   36%   31%   28%    --   -8%
    while 1249/s   90%   71%   48%   42%   39%    9%    --

    --

    Updated. I still think the cleanest of the bunch is the while variation and it is surely showing its colors in ranking up near the top in all the variations. The ()s and array hacks stay right in there tho and both are clear and/or simple as well. As a final test, I passed the match 'c|\d+|ab' against my /var/log/lastlog (300KB) and this is what I got:

          s/iter   s// while split    () array   map    @_
    s//     1.14    --   -4%   -4%   -5%   -5%   -5%   -6%
    while   1.10    4%    --   -0%   -1%   -1%   -2%   -3%
    split   1.10    4%    0%    --   -1%   -1%   -1%   -3%
    ()      1.09    5%    1%    1%    --   -0%   -0%   -2%
    array   1.09    5%    1%    1%    0%    --   -0%   -2%
    map     1.08    5%    2%    1%    0%    0%    --   -1%
    @_      1.07    7%    3%    3%    2%    2%    1%    --

    *snort* I'll le tyou draw your own conclusions. $you = new YOU;
    honk() if $you->love(perl)

    p.s. this post, my 321st, made me a bishop =)

      Darn. I was proud of the @{[]} trick I hacked together for this problem, too bad it scored so poorly. Ah well, thanks for the benchmarks extremely, and congrats on the promotion. {g}

      This kind of reminds me of a show on A&E I caught a few minutes of the other day. It had Jeremy Irons in it and he was trying to rebuild an old clock, either from an old schematic or model, I'm not sure which. At any rate, it was one of the first shipboard clocks, one to counteract the effect the swaying deck had on the pendulum. At one point, he becomes irate, saying "...it's a terrible mess, layer and layer of complexity, one piece correcting for the last. The man absolutely refused to admit he was wrong and come up with other concepts." Or something to that effect. =)

      I just thought that fit nicely in with this. Presented with a problem and current behavior (m//g returns a list of matched values in list context) I used the ol' hammer-and-nail routine. It seemed to work well enough and made absolute sense to me. But some other folks went back to the root of the problem and came up with completely different solutions that worked from an oblique angle. Look at mirod's solution, for example, something I would have never even thought of. Amazing.

      The nature of Perl, I suppose...

      'kaboo

Re: Getting the number of times a regexp matches
by mirod (Canon) on Dec 07, 2000 at 14:42 UTC

    Here are some more solutions:

    $count++ while ($str=~ /$pattern/g); # simple

    or

    $count= (scalar split /$pattern/, $str ) + ($str=~/$pattern$/) # or it will not be counted - 1; # so it's simpler
      The split solution is not correct, because it does not account for multiple occurences of $pattern at the end of the string:
      $str = 'ababbb'; $pattern = 'b'; $count = (scalar split /$pattern/, $str ) + ($str=~/$pattern$/) - 1; print "$count\n";
      2
      Fortunately, this is an easy problem to fix:
      $count = (scalar split /$pattern/, $str, -1) - 1;
      The third argument to split specifies the maximum number of pieces to split the string into. A negative number turns off the stripping of null fields from the end of the list, without limiting the number of pieces.
      I don't think that the second one will be correct for strings like "blue blue blue and blue again" and patterns like "^blue"...

      Update: Sorry mirod, I misread that one :)

      Lets all have a try at it:
      @_=($str =~ m/$pattern/g) and $count=1+$#_;


      /brother t0mas

        Actually it is, $count gets set to 1

Re: Getting the number of times a regexp matches
by quidity (Pilgrim) on Dec 07, 2000 at 17:59 UTC

    Everyone so far seems to have missed this bit of evil context bashing:

    $num_matches = () = $string =~ m/pattern/g;

    This works because the () force the far right hand side to be evaluated in list context, the result of which is then reevaluated in scalar context to give the result. This is a nice example using side effects to good cause in perl.

      Right answer, but I don't approve of your explanation. The reason that you get a count is because a list assignment in a scalar context returns the number of elements on its right-hand side.

      There are a ton of other "operations that would return a list if used in a list context" [often sloppily (: referred to simply as "lists" ] that would return different information if used in a scalar context.

              - tye (but my friends call me "Tye")
      This is an interesting construct, but I don't understand why Perl permits a constant, in this case, the empty list (), to be used in the LHS expression?

      For example, the following code spits out an error (Can't modify constant item in list assignment):

      (1) = (1,2,3);
      so why is the following legal:
      () = (1,2,3);
      Why is empty list not treated as a constant?

        First, because (1) contains a constant (the "1" part) and () contains no constants. So there are no constants being modified so why give an error complaining about you trying to modify no constants.

        Second, as Perl is implemented, I detect a clear preference toward not disallowing things even if the implementor can't think of a good use for that thing at the time. This makes sense for a TIMTOWTDI language.

        Third, we've just demonstrated a use for it. So it is a good thing it wasn't disallowed just because the use wasn't obvious at the time.

        I suspect that this working was at least partially an accident. The list assignment code was written and tested and it worked. I doubt anyone tested this degenerate list assignment. In fact, searching the standard Perl test suite, I find that this feature is not tested but it is used when testing another feature:

        # Should use magical autoinc only when both are strings print "not " unless 0 == (() = "0"..-1); print "ok 14\n"; for my $x ("0"..-1) { print "not "; } print "ok 15\n";
        So there! q-:

                - tye (but my friends call me "Tye")
        I suspect it has something to do with the way these things do work:
        ($a, $b, $c) = (1, 2, 3); ($a, $b) = (1, 2, 3); # 3 discarded ($a, undef, $c) = (1, 2, 3); # 2 discarded ($a, undef) = (1, 2, 3); # 2 and 3 discarded (undef, undef, undef) = (1, 2, 3); # 1 2 and 3 discarded () = (1, 2, 3); # functionally equivalent
        "undef" is the only real non-variable value you can use on the left-hand-side like that.
Re: Getting the number of times a regexp matches
by dchetlin (Friar) on Dec 07, 2000 at 18:33 UTC
    No one mentioned the canonical and most Perlish way to do this:

    my $count = () = $str =~ m/pattern/g;

    I suspect it offers little benefit in terms of efficiency, and can be confusing contextually. I do feel that it's the most succinct and elegant, however.

    Update: My apologies; while I was writing this and getting distracted, it was mentioned ahead of me, and an interesting discussion of it followed. Ah well.

    -dlc