in reply to Tag protein names in sentences

Sample protein names: 7-phospho-2-dehydro-3-deoxy-D-arabino-heptonate D-erythrose-4-phosphat +e lyase (pyruvate-phosphorylating) gamma-glutamyl-gamma-aminobutyraldehyde dehydrogenase (Gamma-Glu-gamma +-aminobutyraldehyde dehydrogenase) 3-phosphoshikimate 1-carboxyvinyltransferase (5-enolpyruvylshikimate-3 +-phosphate synthase)(EPSP synthase Hypothetical protein CBG17340 gonadotrophin alpha 2 subunit Doc4 protein , stress-induced optomotor-blind Dfrizzled-3 Tramtrack69 gutfeeling betaFTZ-F1 Sex-lethal Strabismus PAR3alpha Armadillo AP-2alpha GLUT1CBP eIF3-p44 Flamingo PP2Czeta PLCgamma TFIIIC90 Wingless Frizzled Profilin TXBP151
Sample sentences to tag on their protein names: The present p65(91) (arf)71 results show that the rate of Ca(2+)-induc +ed structural transition and Ca(2+) sensitivity of the inhibitory reg +ion of cTnI were modified by ( 1 ) thin filament formation , ( 2 ) th +e presence of strongly bound S1 , and ( 3 ) PKA phosphorylation of th +e N-terminus of cTnI
The searching for the above sentence takes about 9 sec (not 2 sec as I mentioned earlier as I was using smaller number of protein names) when using 7 million protein names.

Replies are listed 'Best First'.
Re^2: Tag protein names in sentences
by GrandFather (Saint) on Feb 12, 2010 at 22:44 UTC

    I'd build a lookup table then walk over the sentence one word at a time looking for matches:

    use strict; use warnings; my $sentence = 'The Doc4 protein , stress-induced flibbled the woozle +in the presence of gonadotrophin alpha 2 subunit'; my %proteinLU; while (<DATA>) { chomp; my $protein = $_; my @parts = split; my $parent = \%proteinLU; while (@parts) { my $part = shift @parts; $parent = $parent->{$part} ||= {}; next if @parts; $parent->{_name_} = $protein; } } my @words = split ' ', $sentence; while (@words) { my $word = shift @words; next if ! exists $proteinLU{$word}; my $parent = $proteinLU{$word}; my $wIndex = 0; while ($wIndex < @words && exists $parent->{$words[$wIndex]}) { $parent = $parent->{$words[$wIndex++]} } print "$parent->{_name_}\n" if exists $parent->{_name_}; } __DATA__ 7-phospho-2-dehydro-3-deoxy-D-arabino-heptonate D-erythrose-4-phosphat +e lyase (pyruvate-phosphorylating) gamma-glutamyl-gamma-aminobutyraldehyde dehydrogenase (Gamma-Glu-gamma +-aminobutyraldehyde dehydrogenase) 3-phosphoshikimate 1-carboxyvinyltransferase (5-enolpyruvylshikimate-3 +-phosphate synthase)(EPSP synthase Hypothetical protein CBG17340 gonadotrophin alpha 2 subunit Doc4 protein , stress-induced optomotor-blind Dfrizzled-3 Tramtrack69 gutfeeling betaFTZ-F1 Sex-lethal Strabismus PAR3alpha Armadillo AP-2alpha GLUT1CBP eIF3-p44 Flamingo PP2Czeta PLCgamma TFIIIC90 Wingless Frizzled Profilin TXBP151

    Prints:

    Doc4 protein , stress-induced gonadotrophin alpha 2 subunit

    I invented a sentence because the sample sentence didn't seem to include any of the sample proteins so didn't provide a very interesting test case!

    Very likely you will have to normalize the protein names in some fashion and normalize the sentence likewise so that variations in punctuation and white space usage don't prevent valid matches. But you'd have had to do that in any case so I guess you have that sorted out.


    True laziness is hard work
      Thank you so much for the suggestion. The approach works very fast to process each sentence.
      Hi, can you please explain the following code means? I don't understand the sentence "$parent = $parent->{$part} ||= {};" as in:
      while (@parts) { my $part = shift @parts; $parent = $parent->{$part} ||= {}; next if @parts; $parent->{_name_} = $protein; }
      The other part of the code that I don't understand is "@best = ($parent->{_name_}, $wIndex)" as in
      while ($wIndex < @words && exists $parent->{$words[$wIndex]}) { @best = ($parent->{_name_}, $wIndex) if exists $parent->{_name +_}; $parent = $parent->{$words[$wIndex++]}; }
      Thanks for the explaination.

        @parts is a list of the 'words' in a protein name to be matched. The code builds %proteinLU as chains of nested keys. The value for each key is another hash except for _name_ keys whose value is a complete protein name.

        $parent = $parent->{$part} ||= {}; sets the value of a new key to an empty hash. Using ||= in that way avoids an explicit if ! exists $parent->{$part} test.

        The match code works by 'walking' down a chain of nested hash keys. Each time a new key is matched its value becomes the next 'parent'. The assignment to @best 'remembers' the last protein name that matched. @best is an array because two values need to be remembered for the match: the protein name ($parent->{_name_}) and the number of words to remove ($wIndex).


        True laziness is hard work
      Thanks for the helpful technique. I learn something new. I have a question related to this technique. How can I do case insensitive comparison to tag protein name? If my sentence contains the name in the protein name list, but with different case, then I cannot find the matching.
      my $sentence = 'The DOC4 protein , stress-induced flibbled the woozle +in the presence of gonadotrophin alpha 2 subunit'; __DATA__ gonadotrophin gonadotrophin alpha 2 subunit Doc4 protein , stress-induced
      Desired output:
      The **DOC4 protein , stress-induced** flibbled the woozle in the prese +nce of **gonadotrophin alpha 2 subunit**
      It should match the longest protein name first, once the name is matched, we can ignore the matched protein name, and move on to next word. Thanks for any suggestion!

        Use lc to lower case the words used as keys for the lookup hash and lower case the search text. The code already finds long protein names which have a prefix that is the same as a shorter protein's name. I've added a splice to avoid finding a short protein that matches a part of the long protein name that has just been found. Oh, and I updated the print to generate the desired output.

        use strict; use warnings; my $sentence = 'a long mixed case name protein is found in preference +to a mixed case name protein which is found before a short protein'; my %proteinLU; while (<DATA>) { chomp; my $protein = $_; my @parts = split; my $parent = \%proteinLU; while (@parts) { my $part = lc shift @parts; $parent = $parent->{$part} ||= {}; next if @parts; $parent->{_name_} = $protein; } } my @words = map {lc} split ' ', $sentence; while (@words) { my $word = shift @words; if (! exists $proteinLU{$word}) { print "$word "; next; } my $parent = $proteinLU{$word}; my $wIndex = 0; while ($wIndex < @words && exists $parent->{$words[$wIndex]}) { $parent = $parent->{$words[$wIndex++]} } print "**$parent->{_name_}** " if exists $parent->{_name_}; splice @words, 0, $wIndex; } __DATA__ long Mixed Case name protein Mixed Case name protein Protein

        Prints:

        a **long Mixed Case name protein** is found in preference to a **Mixed + Case name protein** which is found before a short **Protein**

        True laziness is hard work