Re: speeding up regex

You didn't post any code, so it's very hard to say. Perhaps you are already using the fastest solution, which only leaves the answer: upgrade your hardware.

And perhaps you are doing something stupid. Then you don't need any clever tricks, you just have to get rid of the stupidity.

And who knows, perhaps you don't need a regexp at all. Maybe you can put the 300 keywords into a hash, and extract all the words from the document, looking for a match. But it will depend on what the keywords are whether that is possible.

Also, is there a way of measuring the execution time of perl scripts without letting the script complete its run?

No, because if the answer would be 'yes', one could solve the halting problem - which is unsolvable.

Abigail

Comment on Re: speeding up regex

Replies are listed 'Best First'.
Re: Re: speeding up regex by dannoura (Pilgrim) on Jul 18, 2003 at 09:50 UTC
Here's the subroutine that's doing all the work. `sub genes { my ($text, $score, @genes)=@_; my $genestr=""; my $count=0; my $total=0; foreach my $gene (@genes) { $count++ while $text =~ /$gene/g; # Count number of instances if ($count) {$genestr.="$gene ";} $total+=$count*$score; $count=0; } return $total, $genestr; }` [download]	[reply] [d/l]
Re: Re: Re: speeding up regex by BrowserUk (Patriarch) on Jul 18, 2003 at 10:30 UTC
You could try replacing `$count++ while $text =~ /$gene/g; # Count number of instances` [download] with `$count = () = $text =~ /$gene/g;` [download] It may run a little quicker. You could also try `my $p=0; ++$count while $p = 1+index( $text, $gene, $p );` [download] Which may be quicker still. If your process takes a long time to run, the obvious thing to do to save having to wait for the whole thing to complete before you can get a feel for which is quickest would be too use a small subset of the data whilst testing. If the text being search comes from a file, try using `head` to grab the first couple of hundred lines of the real data and use that for performance testing the options. Examine what is said, not who speaks. "Efficiency is intelligent laziness." -David Dunham "When I'm working on a problem, I never think about beauty. I think only how to solve the problem. But when I have finished, if the solution is not beautiful, I know it is wrong." -Richard Buckminster Fuller	[reply] [d/l] [select]
Re: speeding up regex by Abigail-II (Bishop) on Jul 18, 2003 at 10:37 UTC
Well, that would also mean that if men is among the words to match, you'd count it twice if a word like amendment appears in the text. Is that what you want, because your description talks about words, but your code just matches any (non-overlapping) substrings. Abigail	[reply]
Compile the regex by arunp (Initiate) on Jul 18, 2003 at 10:55 UTC
Try replacing $count++ while $text =~ /$gene/g; # Count number of instances with my $patn = qr/\b$gene\b/; $count++ while $text =~ /$patn/g;	[reply]
Re: Compile the regex by dbwiz (Curate) on Jul 18, 2003 at 15:20 UTC
Compiling keywords can make a difference if you do all of them at once, before the loop. `#!/usr/bin/perl -w use strict; open WORDS, "<kwords" or die; my %kwords=(); while (<WORDS>) { chomp; $kwords{$_} = qr/\b$_\b/m; } close WORDS; my %found =(); for my $f (<abstract*>) { local $/; open FILE, $f or die "$f\n"; my $text = <FILE>; close FILE; for (keys %kwords) { my $val = $kwords{$_}; $found{$f} .= "$_ " if $text =~ /$val/; } } print "$_\t$found{$_}\n" for sort keys %found;` [download] Assuming that the keywords are in a file, and each abstract is in a separate file, precompilation makes the search 30% faster (using 1000 test files, 300 words each, 3 random keywords in 2/3 of them).	[reply] [d/l]
Re: Compile the regex by Abigail-II (Bishop) on Jul 18, 2003 at 11:12 UTC
Eh, could you give an example string and pattern where the compilation makes a difference? I've tried several patterns and strings, but Benchmark never shows a difference that's more than 1%. The \b could make a difference, but it's so far unclear whether a \b is justified or not. Abigail	[reply]