dannoura has asked for the wisdom of the Perl Monks concerning the following question:

Hi,

I tried using the different methods proposed in this thread to speed up my regex. There are basically three methods (the code is posted below) and I timed them with dprof. The result is

Total Elapsed Time = 0.719999 Seconds User+System Time = 0.186597 Seconds Exclusive Times %Time ExclSec CumulS #Calls sec/call Csec/c Name 17.1 0.032 0.032 2335 0.0000 0.0000 main::method2 16.6 0.031 0.031 2335 0.0000 0.0000 main::method3 8.57 0.016 0.016 2 0.0080 0.0080 main::BEGIN 0.00 0.000 -0.000 1 0.0000 - strict::import 0.00 0.000 -0.000 1 0.0000 - strict::bits 0.00 0.000 -0.000 1 0.0000 - warnings::BEGIN 0.00 0.000 -0.000 1 0.0000 - Exporter::import 0.00 0.000 -0.000 1 0.0000 - warnings::import 0.00 0.000 -0.000 2335 0.0000 - main::method1

The question is why is method1 so much more faster than the others? I'm guessing the people who replied to my post are all experienced programmers so I can see no reason why one method would be so much faster than the other. Does anyone have any ideas about this?

#! c:\perl\bin use strict; use warnings; my $text=""; my @abstracts=(); open(FH, "pc clean25.txt") or die ("can't"); $text=join("",<FH>); (@abstracts)=($text=~/<ABSTRACT>(.*?)<\/ABSTRACT>/sg); foreach my $abstract (@abstracts) { print "method 1: ",method1($abstract, "prostate"),"method 1: ",method2 +($abstract, "prostate"),"method 3: ",method3($abstract, "prostate"); } sub method1 { my $text=shift; my $gene=shift; my $count = () = $text =~ /$gene/g; return $count; } sub method2 { my $text=shift; my $gene=shift; my $count=0; my $p=0; ++$count while $p = 1+index( $text, $gene, $p ); return $count; } sub method3 { my $text=shift; my $gene=shift; my $count=0; my $patn = qr/\b$gene\b/; $count++ while $text =~ /$patn/g; return $count }

And the file simply contains a lot of abstracts.

Replies are listed 'Best First'.
Re: Regular expression speed issues
by Tomte (Priest) on Jul 22, 2003 at 10:04 UTC
    For the record:

    I think Abigail-II is right:

    Benchmark: timing 1000000 iterations of method1, method2, method3... method1: 71 wallclock secs (69.22 usr + 0.13 sys = 69.35 CPU) @ 14 +419.61/s (n=1000000) method2: 39 wallclock secs (39.21 usr + 0.00 sys = 39.21 CPU) @ 25 +503.70/s (n=1000000) method3: 100 wallclock secs (96.76 usr + 0.13 sys = 96.89 CPU) @ 1 +0320.98/s (n=1000000)
    I'm mesuring the execution of a for-loop, that gives inacurate results of the absolute time each method takes, but difference in times is well explainable:
    • method1 uses the regex-engine to loop and count the matches. The engine isn't cheap to use, but faster than method3, that builds this loop in perl.
    • method2 is fastest, because searching a fixed string using index without the overhead (in this case) of the regexp-engine is faster, as Abigail-II explained.
    • method3 is slowest, because it uses the regexp-engine as does method1, but builds the loop-construct to count matches within perl-code, requiring more overhead in perl to manage the loop

    The Benchmark-code:

    regards,
    tomte

    edit:: added interpretation of results


    Hlade's Law:

    If you have a difficult task, give it to a lazy person --
    they will find an easier way to do it.

Re: Regular expression speed issues
by gjb (Vicar) on Jul 22, 2003 at 08:36 UTC

    I'm not sure your benchmark is sound since the execution times are really low. I'd be surprised if it is reliable.

    You might want to have a look at the Benchmark module that comes with the standard distribution, it figures out how many times to run the code to get a reliable benchmark automatically. Although benchmarking is non-trivial, I think you'll have better luck with that approach.

    Just my 2 cents, -gjb-

Re: Regular expression speed issues
by Abigail-II (Bishop) on Jul 22, 2003 at 07:54 UTC
    meth1 and meth3 use regular expressions. meth2 uses index(). Finding fixed strings is much easier than finding patterns.

    You seem to think meth2 shouldn't be much faster than the other methods. I wonder why.

    Abigail

      But note that the presence of utf-8 can slow down functions such as index() and substr() that rely on offsets. It would be possible to get around that by caching the last known offset in a string, but as far as I know, that optimization hasn't been done yet.

      Method1 is the fastest (by far) and it uses a regular expression.

Re: Regular expression speed issues
by TomDLux (Vicar) on Jul 22, 2003 at 08:09 UTC

    If you do DEPARSE or profile the number of calls, you'll find method1 is invoked once. Since it is in a list context, the one call returns the required results. Marshalling arguments into and out of routines is expensive.

    --
    TTTATCGGTCGTTATATAGATGTTTGCA

      I didn't quite understand that. According to the profiler Method1 is invoked 2335 times, same as the other ones. Also, it seems the profiler is unreliable since the run times fluctuate wildly when I run the program repeatedly.

      Now I get:

      Total Elapsed Time = 0.150962 Seconds User+System Time = 0.149962 Seconds Exclusive Times %Time ExclSec CumulS #Calls sec/call Csec/c Name 20.0 0.030 0.028 2335 0.0000 0.0000 main::method1 10.6 0.016 0.014 2335 0.0000 0.0000 main::method3 10.6 0.016 0.016 1 0.0160 0.0160 warnings::BEGIN 0.00 0.000 0.016 2 0.0000 0.0080 main::BEGIN 0.00 0.000 -0.000 1 0.0000 - strict::import 0.00 0.000 -0.000 1 0.0000 - strict::bits 0.00 0.000 -0.000 1 0.0000 - Exporter::import 0.00 0.000 -0.000 1 0.0000 - warnings::import 0.00 0.000 -0.002 2335 0.0000 - main::method2