Regular expression speed issues

dannoura has asked for the wisdom of the Perl Monks concerning the following question:

Hi,

I tried using the different methods proposed in this thread to speed up my regex. There are basically three methods (the code is posted below) and I timed them with dprof. The result is

Total Elapsed Time = 0.719999 Seconds
  User+System Time = 0.186597 Seconds
Exclusive Times
%Time ExclSec CumulS #Calls sec/call Csec/c  Name
 17.1   0.032  0.032   2335   0.0000 0.0000  main::method2
 16.6   0.031  0.031   2335   0.0000 0.0000  main::method3
 8.57   0.016  0.016      2   0.0080 0.0080  main::BEGIN
 0.00   0.000 -0.000      1   0.0000      -  strict::import
 0.00   0.000 -0.000      1   0.0000      -  strict::bits
 0.00   0.000 -0.000      1   0.0000      -  warnings::BEGIN
 0.00   0.000 -0.000      1   0.0000      -  Exporter::import
 0.00   0.000 -0.000      1   0.0000      -  warnings::import
 0.00   0.000 -0.000   2335   0.0000      -  main::method1
[download]

The question is why is method1 so much more faster than the others? I'm guessing the people who replied to my post are all experienced programmers so I can see no reason why one method would be so much faster than the other. Does anyone have any ideas about this?

#! c:\perl\bin
use strict;
use warnings;

my $text="";
my @abstracts=();

open(FH, "pc clean25.txt") or die ("can't");
$text=join("",<FH>);
(@abstracts)=($text=~/<ABSTRACT>(.*?)<\/ABSTRACT>/sg);
foreach my $abstract (@abstracts) {
print "method 1: ",method1($abstract, "prostate"),"method 1: ",method2
+($abstract, "prostate"),"method 3: ",method3($abstract, "prostate");
}
sub method1 {
my $text=shift;
my $gene=shift;
my $count = () = $text =~ /$gene/g;

return $count;
}

sub method2 {
my $text=shift;
my $gene=shift;
my $count=0;

my $p=0; 
++$count while $p = 1+index( $text, $gene, $p );

return $count;
}

sub method3 {
my $text=shift;
my $gene=shift;
my $count=0;

my $patn = qr/\b$gene\b/;
$count++ while $text =~ /$patn/g; 

return $count
}
[download]

And the file simply contains a lot of abstracts.

Comment on Regular expression speed issues Select or Download Code

Replies are listed 'Best First'.
Re: Regular expression speed issues by Tomte (Priest) on Jul 22, 2003 at 10:04 UTC
For the record: I think Abigail-II is right: `Benchmark: timing 1000000 iterations of method1, method2, method3... method1: 71 wallclock secs (69.22 usr + 0.13 sys = 69.35 CPU) @ 14 +419.61/s (n=1000000) method2: 39 wallclock secs (39.21 usr + 0.00 sys = 39.21 CPU) @ 25 +503.70/s (n=1000000) method3: 100 wallclock secs (96.76 usr + 0.13 sys = 96.89 CPU) @ 1 +0320.98/s (n=1000000)` [download] I'm mesuring the execution of a for-loop, that gives inacurate results of the absolute time each method takes, but difference in times is well explainable: method1 uses the regex-engine to loop and count the matches. The engine isn't cheap to use, but faster than method3, that builds this loop in perl. method2 is fastest, because searching a fixed string using `index` without the overhead (in this case) of the regexp-engine is faster, as Abigail-II explained. `method3` is slowest, because it uses the regexp-engine as does method1, but builds the loop-construct to count matches within perl-code, requiring more overhead in perl to manage the loop The Benchmark-code: Read more... (1184 Bytes) regards, tomte edit:: added interpretation of results Hlade's Law: If you have a difficult task, give it to a lazy person -- they will find an easier way to do it.	[reply] [d/l] [select]
Re: Regular expression speed issues by gjb (Vicar) on Jul 22, 2003 at 08:36 UTC
I'm not sure your benchmark is sound since the execution times are really low. I'd be surprised if it is reliable. You might want to have a look at the Benchmark module that comes with the standard distribution, it figures out how many times to run the code to get a reliable benchmark automatically. Although benchmarking is non-trivial, I think you'll have better luck with that approach. Just my 2 cents, -gjb-	[reply]
Re: Regular expression speed issues by Abigail-II (Bishop) on Jul 22, 2003 at 07:54 UTC
`meth1` and `meth3` use regular expressions. `meth2` uses index(). Finding fixed strings is much easier than finding patterns. You seem to think `meth2` shouldn't be much faster than the other methods. I wonder why. Abigail	[reply]
Re: Re: Regular expression speed issues by TimToady (Parson) on Jul 22, 2003 at 16:08 UTC
But note that the presence of utf-8 can slow down functions such as index() and substr() that rely on offsets. It would be possible to get around that by caching the last known offset in a string, but as far as I know, that optimization hasn't been done yet.	[reply]
Re: Re: Regular expression speed issues by dannoura (Pilgrim) on Jul 22, 2003 at 08:02 UTC
Method1 is the fastest (by far) and it uses a regular expression.	[reply]
Re: Regular expression speed issues by TomDLux (Vicar) on Jul 22, 2003 at 08:09 UTC
If you do DEPARSE or profile the number of calls, you'll find `method1` is invoked once. Since it is in a list context, the one call returns the required results. Marshalling arguments into and out of routines is expensive. -- `TTTATCGGTCGTTATATAGATGTTTGCA`	[reply]
Re: Re: Regular expression speed issues by dannoura (Pilgrim) on Jul 22, 2003 at 08:28 UTC
I didn't quite understand that. According to the profiler Method1 is invoked 2335 times, same as the other ones. Also, it seems the profiler is unreliable since the run times fluctuate wildly when I run the program repeatedly. Now I get: Total Elapsed Time = 0.150962 Seconds User+System Time = 0.149962 Seconds Exclusive Times %Time ExclSec CumulS #Calls sec/call Csec/c Name 20.0 0.030 0.028 2335 0.0000 0.0000 main::method1 10.6 0.016 0.014 2335 0.0000 0.0000 main::method3 10.6 0.016 0.016 1 0.0160 0.0160 warnings::BEGIN 0.00 0.000 0.016 2 0.0000 0.0080 main::BEGIN 0.00 0.000 -0.000 1 0.0000 - strict::import 0.00 0.000 -0.000 1 0.0000 - strict::bits 0.00 0.000 -0.000 1 0.0000 - Exporter::import 0.00 0.000 -0.000 1 0.0000 - warnings::import 0.00 0.000 -0.002 2335 0.0000 - main::method2 [download]	[reply] [d/l]