speeding up regex

dannoura has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: speeding up regex by Abigail-II (Bishop) on Jul 18, 2003 at 09:27 UTC
You didn't post any code, so it's very hard to say. Perhaps you are already using the fastest solution, which only leaves the answer: upgrade your hardware. And perhaps you are doing something stupid. Then you don't need any clever tricks, you just have to get rid of the stupidity. And who knows, perhaps you don't need a regexp at all. Maybe you can put the 300 keywords into a hash, and extract all the words from the document, looking for a match. But it will depend on what the keywords are whether that is possible. Also, is there a way of measuring the execution time of perl scripts without letting the script complete its run? No, because if the answer would be 'yes', one could solve the halting problem - which is unsolvable. Abigail	[reply]
Re: Re: speeding up regex by dannoura (Pilgrim) on Jul 18, 2003 at 09:50 UTC
Here's the subroutine that's doing all the work. `sub genes { my ($text, $score, @genes)=@_; my $genestr=""; my $count=0; my $total=0; foreach my $gene (@genes) { $count++ while $text =~ /$gene/g; # Count number of instances if ($count) {$genestr.="$gene ";} $total+=$count*$score; $count=0; } return $total, $genestr; }` [download]	[reply] [d/l]
Re: Re: Re: speeding up regex by BrowserUk (Patriarch) on Jul 18, 2003 at 10:30 UTC
You could try replacing `$count++ while $text =~ /$gene/g; # Count number of instances` [download] with `$count = () = $text =~ /$gene/g;` [download] It may run a little quicker. You could also try `my $p=0; ++$count while $p = 1+index( $text, $gene, $p );` [download] Which may be quicker still. If your process takes a long time to run, the obvious thing to do to save having to wait for the whole thing to complete before you can get a feel for which is quickest would be too use a small subset of the data whilst testing. If the text being search comes from a file, try using `head` to grab the first couple of hundred lines of the real data and use that for performance testing the options. Examine what is said, not who speaks. "Efficiency is intelligent laziness." -David Dunham "When I'm working on a problem, I never think about beauty. I think only how to solve the problem. But when I have finished, if the solution is not beautiful, I know it is wrong." -Richard Buckminster Fuller	[reply] [d/l] [select]
Re: speeding up regex by Abigail-II (Bishop) on Jul 18, 2003 at 10:37 UTC
Well, that would also mean that if men is among the words to match, you'd count it twice if a word like amendment appears in the text. Is that what you want, because your description talks about words, but your code just matches any (non-overlapping) substrings. Abigail	[reply]
Compile the regex by arunp (Initiate) on Jul 18, 2003 at 10:55 UTC
Try replacing $count++ while $text =~ /$gene/g; # Count number of instances with my $patn = qr/\b$gene\b/; $count++ while $text =~ /$patn/g;	[reply]
Re: Compile the regex by dbwiz (Curate) on Jul 18, 2003 at 15:20 UTC
Re: Compile the regex by Abigail-II (Bishop) on Jul 18, 2003 at 11:12 UTC
Re: speeding up regex by tilly (Archbishop) on Jul 18, 2003 at 17:19 UTC
You might find some good ideas in RE (tilly) 4: SAS log scanner. As for measuring execution time without letting the script finish, a common approach is to have the script print out time elapsed at regular milestones. If you wish to be fancy, you can use Time::HiRes to print fractions of a second. Of course you can't get very good information this way, but you can get a vague idea how how it is doing before it completes.	[reply]
Re: speeding up regex by ajdelore (Pilgrim) on Jul 18, 2003 at 16:20 UTC
Depending on your exact needs, you may want to consider building an index to speed up this kind of search. Once you have an index (basically a list of words), you can start trying different search algorithims to query the index. YMMV, of course. </ajdelore>	[reply]