Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Dear Monks,
I am hereby seeking your expertise on the following problem:
I have this snippet of code:
$count=0; while($_=~/(M+)/g) { $count++; } print $count."\n";

which reads lines like the following:
$_='iiiiiiiiMMMMMMMMMMMooooooooooooMMMMMMMMMMiiiiiMMMMMMMMoooo';

and counts (obviously) the number of occurrences of M+ (in this example it's 3). Is there any faster way of writing the code I gave you? I am asking this because this code will be used in reading files with maybe even 80.000 lines like this and if I could find a faster way I could save up some time.
Thank you in advance!

Replies are listed 'Best First'.
Re: Can you write a faster code to perform this task?
by BrowserUk (Patriarch) on Sep 28, 2014 at 23:06 UTC

    Running this slightly modified version of your code:

    #! perl -slw use strict; use Time::HiRes qw[ time ]; my $start = time; my $count=0; while( <DATA> ) { while( /(M+)/g ) { $count++; } } print $count."\n"; printf "Took %9f secs\n", time() - $start; __DATA__ iiiiiiiiMMMMMMMMMMMooooooooooooMMMMMMMMMMiiiiiMMMMMMMMoooo iiiiiiiiMMMMMMMMMMMooooooooooooMMMMMMMMMMiiiiiMMMMMMMMoooo ... 79998 more lines as above ...

    Resulted in this output:

    C:\test>junk 240000 Took 0.199589 secs

    0.2 of a second to count the quarter million occurrences of your search term in 80,000 lines doesn't seem excessive to me. How much time are you looking to save?


    With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.
Re: Can you write a faster code to perform this task?
by AnomalousMonk (Archbishop) on Sep 28, 2014 at 23:03 UTC

    You'll have to Benchmark it yourself to see if it's any faster, but maybe:

    c:\@Work\Perl\monks>perl -wMstrict -le "$_ = 'iiiiiiiiMMMMMMMMMMMooooooooooooMMMMMMMMMMiiiiiMMMMMMMMoooo'; my $count =()= $_ =~ m{ M+ }xmsg; print $count; " 3

Re: Can you write a faster code to perform this task?
by davido (Cardinal) on Sep 29, 2014 at 01:48 UTC

    How complex does your pattern need to become? In the case of a pattern that consists of (M+), it's unnecessary to rev up the regexp engine.

    General solutions work pretty well for the general case. Optimizations work by knowing more about the specific case, possibly allowing you to design the code with shortcuts, bypassing work that the general case might need, but that are unnecessary in the specific case. The more we know about the specifics, the better we can do at figuring out what shortcuts can be applied here.


    Dave

Re: Can you write a faster code to perform this task?
by Anonymous Monk on Sep 28, 2014 at 23:14 UTC

    Benchmarked a few solutions:

    use Benchmark qw(:all); my $str = 'iiiiiiiiMMMMMMMMMMMooooooooooooMMMMMMMMMMiiiiiMMMMMMMMoooo' +; my $these = { 'count' => sub { my $count = 0; $count++ while $str =~ /(M+)/g; $count; }, 'regex' => sub { my $count = () = $str =~ /(M+)/g; }, 'split' => sub { $#{[split /M+/, $str, -1]}; }, }; printf "%-10s %d\n", $_, $these->{$_}->() for sort keys %$these; print "\n"; cmpthese(-1, $these);

    Got these:

    count 3 regex 3 split 3 Rate split regex count split 335814/s -- -3% -30% regex 345739/s 3% -- -27% count 476625/s 42% 38% --
Re: Can you write a faster code to perform this task?
by Anonymous Monk on Sep 28, 2014 at 23:09 UTC

    I just generated a file like the line you showed containing 80000 80-char lines; your code processed it in 0.3s. Even when I increased the line length to 1000 and the number of lines to 100k, the execution time stayed under a second. What kind of a speed gain are you looking for? Is the input and code you showed perhaps a simplified version of what you're actually doing? These kind of details do matter in optimizations.

Re: Can you write a faster code to perform this task?
by Laurent_R (Canon) on Sep 29, 2014 at 06:13 UTC
    I can't benchmark right now, but I would think that the tr/// operator is likely to be faster than a regex. Having said that, 80,000 lines is not a very large input, I am not sure that there is really a need to optimize further what you have and you're going, at best, to shave off a split second.
      tr counts single letters while the OP wants to count sequences of identical letters.

      Cheers Rolf

      (addicted to the Perl Programming Language and ☆☆☆☆ :)

        What about something like this?
        use 5.014; say 'iiiiiiiiMMMMMMMMMMMooooooooooooMMMMMMMMMMiiiiiMMMMMMMMoooo' =~ tr/M/~/sr=~tr/~//; # => prints: 3
        Update: just to be sure, it would be safer to replace 'M' with something much exotic, like '\0', which should be OK, assuming that you're reading text files. (tr/M/\0/sr =~ tr/\0//)
        True, Rolf, I mis-read the OP's requirement. I thought he wanted the number of "M". But, OTOH, trizen has found a way to use tr/// which seems to be very fast.
Re: Can you write a faster code to perform this task?
by karlgoethebier (Abbot) on Sep 29, 2014 at 13:32 UTC

    Not the fastest but nice: scalar map { /(M+)/g } ($str);

    Compared to the anon monk's benchmark below:

    count 3 map 3 regex 3 split 3 Rate split regex map count split 207677/s -- -7% -28% -47% regex 223050/s 7% -- -23% -43% map 288533/s 39% 29% -- -27% count 394162/s 90% 77% 37% --

    Regards, Karl

    «The Crux of the Biscuit is the Apostrophe»