sinlam has asked for the wisdom of the Perl Monks concerning the following question:

Hi, I have a performance issue to use Perl regular expression to tag protein names in list of sentences. I have 7 million protein names. Each of the 7 million protein names may contain multiple words, and I would like to tag protein names in a sentence which has perfect match in one of the 7 million names.

The protein names are sorted to match the longest protein name first and I store all protein names into one big string separated by delimiter '|'. The code is shown below:

$sortedproteinnamesstring = join '|', reverse sort map { quotemeta } @ +proteinnames; $line =~ s/($sortedproteinnamesstring)/\*\*$1\*\*/ig;
Each sentence is read and put in a variable '$line'. The regular expression on the 2nd line in the code will take from 0.2 sec to 9.5 sec to process each sentence, varies for each sentence. If I have many sentences to tag, it will take many minutes. I would like to know if there is any alternative to speed up the regular expression searching? Thanks for any suggestion!

Replies are listed 'Best First'.
Re: Tag protein names in sentences
by BrowserUk (Patriarch) on Feb 12, 2010 at 19:41 UTC

    Are you using perl v5.10.0 or later?

    I suspect that you may be falling foul of a pathological case of the regex Trie optimisation. See 5.10.0 regex slowdown for some discussion.

    You might find your performance improves if you include:

    BEGIN{ ${^RE_TRIE_MAXBUF} = 0; }

    at the top of your program.

    Note: That's just one possibility that I draw from the referenced thread. Things may have move on since then.

    Another possibility would be to break your 7 million alternations regex up into smaller ones. For example, you could group them into subsets keyed by the first "word", and build an alternation regex from each group. You'd then interate over the hash of re's and only invoke the full sub-re on the sentence if it contains the first word.

    Something like:

    @proteinNames = ...; my %res; m[^(\S+)] and push @{ $res{ $1 } }, $_ for @proteinNames; $res{ $_ } = join '|', reverse sort map { quotemeta } @{ $res{ $_ } } for keys %res; while( my $line = <> ) { for my $key ( keys %res ) { if( $line =~ $key ) { $line =~ s[($re{ $key })][\*\*$1\*\*]ig; print $line; } } }

    Whether that would help or hinder is impossible to say without testing it on real data.


    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.
Re: Tag protein names in sentences
by BioLion (Curate) on Feb 12, 2010 at 19:04 UTC

    Searching over and over is going to take forever! You would do better putting your time into computing a lookup table of the names, then breaking your sentences down into tokens (words) and checking for their existence in the lookup table. You could use either a hash, if you have the memory, or a database.

    Have a look around, maybe time spent now will save you time in the long run, but if you are only doing this search the once, maybe just take the hit of long computation time, and get onto the next problem.

    Sorry I can't be of more help, but I am going home!

    Just a something something...
Re: Tag protein names in sentences
by zwon (Abbot) on Feb 12, 2010 at 18:57 UTC
    It would help if you show us sample data. Could you post example of @proteinnam and one (or more) of these lines on which it takes 2 seconds to match?
Re: Tag protein names in sentences
by sinlam (Novice) on Feb 12, 2010 at 19:56 UTC
    Sample protein names: 7-phospho-2-dehydro-3-deoxy-D-arabino-heptonate D-erythrose-4-phosphat +e lyase (pyruvate-phosphorylating) gamma-glutamyl-gamma-aminobutyraldehyde dehydrogenase (Gamma-Glu-gamma +-aminobutyraldehyde dehydrogenase) 3-phosphoshikimate 1-carboxyvinyltransferase (5-enolpyruvylshikimate-3 +-phosphate synthase)(EPSP synthase Hypothetical protein CBG17340 gonadotrophin alpha 2 subunit Doc4 protein , stress-induced optomotor-blind Dfrizzled-3 Tramtrack69 gutfeeling betaFTZ-F1 Sex-lethal Strabismus PAR3alpha Armadillo AP-2alpha GLUT1CBP eIF3-p44 Flamingo PP2Czeta PLCgamma TFIIIC90 Wingless Frizzled Profilin TXBP151
    Sample sentences to tag on their protein names: The present p65(91) (arf)71 results show that the rate of Ca(2+)-induc +ed structural transition and Ca(2+) sensitivity of the inhibitory reg +ion of cTnI were modified by ( 1 ) thin filament formation , ( 2 ) th +e presence of strongly bound S1 , and ( 3 ) PKA phosphorylation of th +e N-terminus of cTnI
    The searching for the above sentence takes about 9 sec (not 2 sec as I mentioned earlier as I was using smaller number of protein names) when using 7 million protein names.

      I'd build a lookup table then walk over the sentence one word at a time looking for matches:

      use strict; use warnings; my $sentence = 'The Doc4 protein , stress-induced flibbled the woozle +in the presence of gonadotrophin alpha 2 subunit'; my %proteinLU; while (<DATA>) { chomp; my $protein = $_; my @parts = split; my $parent = \%proteinLU; while (@parts) { my $part = shift @parts; $parent = $parent->{$part} ||= {}; next if @parts; $parent->{_name_} = $protein; } } my @words = split ' ', $sentence; while (@words) { my $word = shift @words; next if ! exists $proteinLU{$word}; my $parent = $proteinLU{$word}; my $wIndex = 0; while ($wIndex < @words && exists $parent->{$words[$wIndex]}) { $parent = $parent->{$words[$wIndex++]} } print "$parent->{_name_}\n" if exists $parent->{_name_}; } __DATA__ 7-phospho-2-dehydro-3-deoxy-D-arabino-heptonate D-erythrose-4-phosphat +e lyase (pyruvate-phosphorylating) gamma-glutamyl-gamma-aminobutyraldehyde dehydrogenase (Gamma-Glu-gamma +-aminobutyraldehyde dehydrogenase) 3-phosphoshikimate 1-carboxyvinyltransferase (5-enolpyruvylshikimate-3 +-phosphate synthase)(EPSP synthase Hypothetical protein CBG17340 gonadotrophin alpha 2 subunit Doc4 protein , stress-induced optomotor-blind Dfrizzled-3 Tramtrack69 gutfeeling betaFTZ-F1 Sex-lethal Strabismus PAR3alpha Armadillo AP-2alpha GLUT1CBP eIF3-p44 Flamingo PP2Czeta PLCgamma TFIIIC90 Wingless Frizzled Profilin TXBP151

      Prints:

      Doc4 protein , stress-induced gonadotrophin alpha 2 subunit

      I invented a sentence because the sample sentence didn't seem to include any of the sample proteins so didn't provide a very interesting test case!

      Very likely you will have to normalize the protein names in some fashion and normalize the sentence likewise so that variations in punctuation and white space usage don't prevent valid matches. But you'd have had to do that in any case so I guess you have that sorted out.


      True laziness is hard work
        Hi, can you please explain the following code means? I don't understand the sentence "$parent = $parent->{$part} ||= {};" as in:
        while (@parts) { my $part = shift @parts; $parent = $parent->{$part} ||= {}; next if @parts; $parent->{_name_} = $protein; }
        The other part of the code that I don't understand is "@best = ($parent->{_name_}, $wIndex)" as in
        while ($wIndex < @words && exists $parent->{$words[$wIndex]}) { @best = ($parent->{_name_}, $wIndex) if exists $parent->{_name +_}; $parent = $parent->{$words[$wIndex++]}; }
        Thanks for the explaination.
        Thank you so much for the suggestion. The approach works very fast to process each sentence.
        Thanks for the helpful technique. I learn something new. I have a question related to this technique. How can I do case insensitive comparison to tag protein name? If my sentence contains the name in the protein name list, but with different case, then I cannot find the matching.
        my $sentence = 'The DOC4 protein , stress-induced flibbled the woozle +in the presence of gonadotrophin alpha 2 subunit'; __DATA__ gonadotrophin gonadotrophin alpha 2 subunit Doc4 protein , stress-induced
        Desired output:
        The **DOC4 protein , stress-induced** flibbled the woozle in the prese +nce of **gonadotrophin alpha 2 subunit**
        It should match the longest protein name first, once the name is matched, we can ignore the matched protein name, and move on to next word. Thanks for any suggestion!