dannoura has asked for the wisdom of the Perl Monks concerning the following question:

Hi,

I have a script which looks for instances of about 300 keywords in about 6,000 article abstracts. Needless to say, it's very slow. Which brings me to my question: what (if any) are the standard ways of speeding up processes like these?

Also, is there a way of measuring the execution time of perl scripts without letting the script complete its run?

Replies are listed 'Best First'.
Re: speeding up regex
by Abigail-II (Bishop) on Jul 18, 2003 at 09:27 UTC
    You didn't post any code, so it's very hard to say. Perhaps you are already using the fastest solution, which only leaves the answer: upgrade your hardware.

    And perhaps you are doing something stupid. Then you don't need any clever tricks, you just have to get rid of the stupidity.

    And who knows, perhaps you don't need a regexp at all. Maybe you can put the 300 keywords into a hash, and extract all the words from the document, looking for a match. But it will depend on what the keywords are whether that is possible.

    Also, is there a way of measuring the execution time of perl scripts without letting the script complete its run?
    No, because if the answer would be 'yes', one could solve the halting problem - which is unsolvable.

    Abigail

      Here's the subroutine that's doing all the work.

      sub genes { my ($text, $score, @genes)=@_; my $genestr=""; my $count=0; my $total=0; foreach my $gene (@genes) { $count++ while $text =~ /$gene/g; # Count number of instances if ($count) {$genestr.="$gene ";} $total+=$count*$score; $count=0; } return $total, $genestr; }

        You could try replacing

        $count++ while $text =~ /$gene/g; # Count number of instances
        with
        $count = () = $text =~ /$gene/g;

        It may run a little quicker. You could also try

        my $p=0; ++$count while $p = 1+index( $text, $gene, $p );

        Which may be quicker still.

        If your process takes a long time to run, the obvious thing to do to save having to wait for the whole thing to complete before you can get a feel for which is quickest would be too use a small subset of the data whilst testing. If the text being search comes from a file, try using head to grab the first couple of hundred lines of the real data and use that for performance testing the options.


        Examine what is said, not who speaks.
        "Efficiency is intelligent laziness." -David Dunham
        "When I'm working on a problem, I never think about beauty. I think only how to solve the problem. But when I have finished, if the solution is not beautiful, I know it is wrong." -Richard Buckminster Fuller

        Well, that would also mean that if men is among the words to match, you'd count it twice if a word like amendment appears in the text. Is that what you want, because your description talks about words, but your code just matches any (non-overlapping) substrings.

        Abigail

        Try replacing

        $count++ while $text =~ /$gene/g; # Count number of instances

        with

        my $patn = qr/\b$gene\b/;
        $count++ while $text =~ /$patn/g;
Re: speeding up regex
by tilly (Archbishop) on Jul 18, 2003 at 17:19 UTC
    You might find some good ideas in RE (tilly) 4: SAS log scanner.

    As for measuring execution time without letting the script finish, a common approach is to have the script print out time elapsed at regular milestones. If you wish to be fancy, you can use Time::HiRes to print fractions of a second.

    Of course you can't get very good information this way, but you can get a vague idea how how it is doing before it completes.

Re: speeding up regex
by ajdelore (Pilgrim) on Jul 18, 2003 at 16:20 UTC

    Depending on your exact needs, you may want to consider building an index to speed up this kind of search.

    Once you have an index (basically a list of words), you can start trying different search algorithims to query the index. YMMV, of course.

    </ajdelore>