in reply to speeding up regex

You didn't post any code, so it's very hard to say. Perhaps you are already using the fastest solution, which only leaves the answer: upgrade your hardware.

And perhaps you are doing something stupid. Then you don't need any clever tricks, you just have to get rid of the stupidity.

And who knows, perhaps you don't need a regexp at all. Maybe you can put the 300 keywords into a hash, and extract all the words from the document, looking for a match. But it will depend on what the keywords are whether that is possible.

Also, is there a way of measuring the execution time of perl scripts without letting the script complete its run?
No, because if the answer would be 'yes', one could solve the halting problem - which is unsolvable.

Abigail

Replies are listed 'Best First'.
Re: Re: speeding up regex
by dannoura (Pilgrim) on Jul 18, 2003 at 09:50 UTC

    Here's the subroutine that's doing all the work.

    sub genes { my ($text, $score, @genes)=@_; my $genestr=""; my $count=0; my $total=0; foreach my $gene (@genes) { $count++ while $text =~ /$gene/g; # Count number of instances if ($count) {$genestr.="$gene ";} $total+=$count*$score; $count=0; } return $total, $genestr; }

      You could try replacing

      $count++ while $text =~ /$gene/g; # Count number of instances
      with
      $count = () = $text =~ /$gene/g;

      It may run a little quicker. You could also try

      my $p=0; ++$count while $p = 1+index( $text, $gene, $p );

      Which may be quicker still.

      If your process takes a long time to run, the obvious thing to do to save having to wait for the whole thing to complete before you can get a feel for which is quickest would be too use a small subset of the data whilst testing. If the text being search comes from a file, try using head to grab the first couple of hundred lines of the real data and use that for performance testing the options.


      Examine what is said, not who speaks.
      "Efficiency is intelligent laziness." -David Dunham
      "When I'm working on a problem, I never think about beauty. I think only how to solve the problem. But when I have finished, if the solution is not beautiful, I know it is wrong." -Richard Buckminster Fuller

      Well, that would also mean that if men is among the words to match, you'd count it twice if a word like amendment appears in the text. Is that what you want, because your description talks about words, but your code just matches any (non-overlapping) substrings.

      Abigail

      Try replacing

      $count++ while $text =~ /$gene/g; # Count number of instances

      with

      my $patn = qr/\b$gene\b/;
      $count++ while $text =~ /$patn/g;

        Compiling keywords can make a difference if you do all of them at once, before the loop.

        #!/usr/bin/perl -w use strict; open WORDS, "<kwords" or die; my %kwords=(); while (<WORDS>) { chomp; $kwords{$_} = qr/\b$_\b/m; } close WORDS; my %found =(); for my $f (<abstract*>) { local $/; open FILE, $f or die "$f\n"; my $text = <FILE>; close FILE; for (keys %kwords) { my $val = $kwords{$_}; $found{$f} .= "$_ " if $text =~ /$val/; } } print "$_\t$found{$_}\n" for sort keys %found;

        Assuming that the keywords are in a file, and each abstract is in a separate file, precompilation makes the search 30% faster (using 1000 test files, 300 words each, 3 random keywords in 2/3 of them).

        Eh, could you give an example string and pattern where the compilation makes a difference? I've tried several patterns and strings, but Benchmark never shows a difference that's more than 1%.

        The \b could make a difference, but it's so far unclear whether a \b is justified or not.

        Abigail