in reply to Re: Tag protein names in sentences
in thread Tag protein names in sentences

I'd build a lookup table then walk over the sentence one word at a time looking for matches:

use strict; use warnings; my $sentence = 'The Doc4 protein , stress-induced flibbled the woozle +in the presence of gonadotrophin alpha 2 subunit'; my %proteinLU; while (<DATA>) { chomp; my $protein = $_; my @parts = split; my $parent = \%proteinLU; while (@parts) { my $part = shift @parts; $parent = $parent->{$part} ||= {}; next if @parts; $parent->{_name_} = $protein; } } my @words = split ' ', $sentence; while (@words) { my $word = shift @words; next if ! exists $proteinLU{$word}; my $parent = $proteinLU{$word}; my $wIndex = 0; while ($wIndex < @words && exists $parent->{$words[$wIndex]}) { $parent = $parent->{$words[$wIndex++]} } print "$parent->{_name_}\n" if exists $parent->{_name_}; } __DATA__ 7-phospho-2-dehydro-3-deoxy-D-arabino-heptonate D-erythrose-4-phosphat +e lyase (pyruvate-phosphorylating) gamma-glutamyl-gamma-aminobutyraldehyde dehydrogenase (Gamma-Glu-gamma +-aminobutyraldehyde dehydrogenase) 3-phosphoshikimate 1-carboxyvinyltransferase (5-enolpyruvylshikimate-3 +-phosphate synthase)(EPSP synthase Hypothetical protein CBG17340 gonadotrophin alpha 2 subunit Doc4 protein , stress-induced optomotor-blind Dfrizzled-3 Tramtrack69 gutfeeling betaFTZ-F1 Sex-lethal Strabismus PAR3alpha Armadillo AP-2alpha GLUT1CBP eIF3-p44 Flamingo PP2Czeta PLCgamma TFIIIC90 Wingless Frizzled Profilin TXBP151

Prints:

Doc4 protein , stress-induced gonadotrophin alpha 2 subunit

I invented a sentence because the sample sentence didn't seem to include any of the sample proteins so didn't provide a very interesting test case!

Very likely you will have to normalize the protein names in some fashion and normalize the sentence likewise so that variations in punctuation and white space usage don't prevent valid matches. But you'd have had to do that in any case so I guess you have that sorted out.


True laziness is hard work

Replies are listed 'Best First'.
Re^3: Tag protein names in sentences
by sinlam (Novice) on Feb 16, 2010 at 17:13 UTC
    Thank you so much for the suggestion. The approach works very fast to process each sentence.
Re^3: Tag protein names in sentences
by sinlam (Novice) on Feb 17, 2010 at 23:11 UTC
    Hi, can you please explain the following code means? I don't understand the sentence "$parent = $parent->{$part} ||= {};" as in:
    while (@parts) { my $part = shift @parts; $parent = $parent->{$part} ||= {}; next if @parts; $parent->{_name_} = $protein; }
    The other part of the code that I don't understand is "@best = ($parent->{_name_}, $wIndex)" as in
    while ($wIndex < @words && exists $parent->{$words[$wIndex]}) { @best = ($parent->{_name_}, $wIndex) if exists $parent->{_name +_}; $parent = $parent->{$words[$wIndex++]}; }
    Thanks for the explaination.

      @parts is a list of the 'words' in a protein name to be matched. The code builds %proteinLU as chains of nested keys. The value for each key is another hash except for _name_ keys whose value is a complete protein name.

      $parent = $parent->{$part} ||= {}; sets the value of a new key to an empty hash. Using ||= in that way avoids an explicit if ! exists $parent->{$part} test.

      The match code works by 'walking' down a chain of nested hash keys. Each time a new key is matched its value becomes the next 'parent'. The assignment to @best 'remembers' the last protein name that matched. @best is an array because two values need to be remembered for the match: the protein name ($parent->{_name_}) and the number of words to remove ($wIndex).


      True laziness is hard work
        Hi, thanks for the explanation, I get to understand the code better. Sorry, I have another question as I want to use this data structure in future. Is it possible to use a loop to traverse the whole proteinLU hash of hash table? I spent sometimes trying to do it, but I could not successfully traverse it. Thanks for your help!
Re^3: Tag protein names in sentences
by Anonymous Monk on Feb 15, 2010 at 21:19 UTC
    Thanks for the helpful technique. I learn something new. I have a question related to this technique. How can I do case insensitive comparison to tag protein name? If my sentence contains the name in the protein name list, but with different case, then I cannot find the matching.
    my $sentence = 'The DOC4 protein , stress-induced flibbled the woozle +in the presence of gonadotrophin alpha 2 subunit'; __DATA__ gonadotrophin gonadotrophin alpha 2 subunit Doc4 protein , stress-induced
    Desired output:
    The **DOC4 protein , stress-induced** flibbled the woozle in the prese +nce of **gonadotrophin alpha 2 subunit**
    It should match the longest protein name first, once the name is matched, we can ignore the matched protein name, and move on to next word. Thanks for any suggestion!

      Use lc to lower case the words used as keys for the lookup hash and lower case the search text. The code already finds long protein names which have a prefix that is the same as a shorter protein's name. I've added a splice to avoid finding a short protein that matches a part of the long protein name that has just been found. Oh, and I updated the print to generate the desired output.

      use strict; use warnings; my $sentence = 'a long mixed case name protein is found in preference +to a mixed case name protein which is found before a short protein'; my %proteinLU; while (<DATA>) { chomp; my $protein = $_; my @parts = split; my $parent = \%proteinLU; while (@parts) { my $part = lc shift @parts; $parent = $parent->{$part} ||= {}; next if @parts; $parent->{_name_} = $protein; } } my @words = map {lc} split ' ', $sentence; while (@words) { my $word = shift @words; if (! exists $proteinLU{$word}) { print "$word "; next; } my $parent = $proteinLU{$word}; my $wIndex = 0; while ($wIndex < @words && exists $parent->{$words[$wIndex]}) { $parent = $parent->{$words[$wIndex++]} } print "**$parent->{_name_}** " if exists $parent->{_name_}; splice @words, 0, $wIndex; } __DATA__ long Mixed Case name protein Mixed Case name protein Protein

      Prints:

      a **long Mixed Case name protein** is found in preference to a **Mixed + Case name protein** which is found before a short **Protein**

      True laziness is hard work
        Hi, I notice a problem which does not tag short protein name and some words in the sentence are missing. But I don't know what is wrong in the script. I extracted the protein names which contains word "MCM2" and the test sentence which has problem. Is the problem related to redundancy in the protein names with different case? Thanks for any information!
        use strict; use warnings; my $sentence = 'Protein kinase A-anchoring protein AKAP95 interacts wi +th MCM2 , a regulator of DNA replication'; my %proteinLU; while (<DATA>) { chomp; my $protein = $_; my @parts = split; my $parent = \%proteinLU; while (@parts) { my $part = lc shift @parts; $parent = $parent->{$part} ||= {}; next if @parts; $parent->{_name_} = $protein; } } my @words = map {lc} split ' ', $sentence; while (@words) { my $word = shift @words; if (! exists $proteinLU{$word}) { print "$word "; next; } my $parent = $proteinLU{$word}; my $wIndex = 0; while ($wIndex < @words && exists $parent->{$words[$wIndex]}) { $parent = $parent->{$words[$wIndex++]} } print "**$parent->{_name_}** " if exists $parent->{_name_}; splice @words, 0, $wIndex; } __DATA__ replication licensing factor MCM2/3/5-type protein replication licensing factor MCM2 ALT_NAMES:MCM2p replication licensing factor MCM2 ALT_NAMES:CDCL1 minichromosome maintenance protein MCM2 homolog minichromosome maintenance protein MCM2 DNA replication licensing factor Mcm2 DNA replication licensing factor mcm2 DNA REPLICATION LICENSING FACTOR MCM2 mcm2 DNA replication licensing factor Mcm2 DNA replication licensing factor DNA replication licensing factor MCM2 contains MCM2/3/5 family signature Central kinetochore subunit MCM22 replication licensing factor MCM2 Central kinetochore subunit MCM21 Similarity to yeast mcm2 protein similarity to yeast mcm2 protein MCM2/3/5 family , most YBR1441 novel MCM2/3/5 family member NOVEL MCM2/3/5 FAMILY MEMBER budding yeast MCM2 homolog member of MCM2/3/5 family MCM complex subunit Mcm2 MCM2/3/5 family protein MCM2-related protein At5g18020/MCM23_11 At5g18030/MCM23_13 mcm2-prov protein MCM2 , FORMERLY MCM2/3/5 family P1 clone:MCM23 mcm2 protein Mcm2 protein MCM2 protein mcm2-prov MCM23_15 MCM23_13 MCM23.13 MCM23.11 MCM23.15 MCM23.16 MCM23_14 MCM23_16 MCM23.14 MCM23_11 MCM23_7 Mcm2-PA MCM23.9 MCM23.6 MCM23_4 Mcm2-RA MCM23.7 Mcm2-XP Mcm2-XR MCM23.4 MCM23_6 MCM23_1 MCM23_3 MCM23.3 MCM23.1 MCM23_2 MCM23.5 MCM23_9 MCM23_5 MCM23.2 mcm22p mcm2_1 Mcm22p mcm2_2 Mcm21p mcm21p MCM2_1 huMCM2 DmMcm2 MCM2_2 DmMCM2 X.MCM2 hMCM2 dMCM2 mcm22 MCM21 mMCM2 xMCM2 MCM2p MCM22 mcm2p mMcm2 mcm21 MMCM2 MCM2 mcm2 Mcm2
        Desired output will be:
        Protein kinase A-anchoring protein AKAP95 interacts with **MCM2** , a +regulator of DNA replication.