Re: Tag protein names in sentences

Replies are listed 'Best First'.
Re^2: Tag protein names in sentences by GrandFather (Saint) on Feb 12, 2010 at 22:44 UTC
I'd build a lookup table then walk over the sentence one word at a time looking for matches: use strict; use warnings; my $sentence = 'The Doc4 protein , stress-induced flibbled the woozle +in the presence of gonadotrophin alpha 2 subunit'; my %proteinLU; while (<DATA>) { chomp; my $protein = $_; my @parts = split; my $parent = \%proteinLU; while (@parts) { my $part = shift @parts; $parent = $parent->{$part} \|\|= {}; next if @parts; $parent->{_name_} = $protein; } } my @words = split ' ', $sentence; while (@words) { my $word = shift @words; next if ! exists $proteinLU{$word}; my $parent = $proteinLU{$word}; my $wIndex = 0; while ($wIndex < @words && exists $parent->{$words[$wIndex]}) { $parent = $parent->{$words[$wIndex++]} } print "$parent->{_name_}\n" if exists $parent->{_name_}; } __DATA__ 7-phospho-2-dehydro-3-deoxy-D-arabino-heptonate D-erythrose-4-phosphat +e lyase (pyruvate-phosphorylating) gamma-glutamyl-gamma-aminobutyraldehyde dehydrogenase (Gamma-Glu-gamma +-aminobutyraldehyde dehydrogenase) 3-phosphoshikimate 1-carboxyvinyltransferase (5-enolpyruvylshikimate-3 +-phosphate synthase)(EPSP synthase Hypothetical protein CBG17340 gonadotrophin alpha 2 subunit Doc4 protein , stress-induced optomotor-blind Dfrizzled-3 Tramtrack69 gutfeeling betaFTZ-F1 Sex-lethal Strabismus PAR3alpha Armadillo AP-2alpha GLUT1CBP eIF3-p44 Flamingo PP2Czeta PLCgamma TFIIIC90 Wingless Frizzled Profilin TXBP151 [download] Prints: `Doc4 protein , stress-induced gonadotrophin alpha 2 subunit` [download] I invented a sentence because the sample sentence didn't seem to include any of the sample proteins so didn't provide a very interesting test case! Very likely you will have to normalize the protein names in some fashion and normalize the sentence likewise so that variations in punctuation and white space usage don't prevent valid matches. But you'd have had to do that in any case so I guess you have that sorted out. True laziness is hard work	[reply] [d/l] [select]
Re^3: Tag protein names in sentences by sinlam (Novice) on Feb 16, 2010 at 17:13 UTC
Thank you so much for the suggestion. The approach works very fast to process each sentence.	[reply]
Re^3: Tag protein names in sentences by sinlam (Novice) on Feb 17, 2010 at 23:11 UTC
Hi, can you please explain the following code means? I don't understand the sentence "$parent = $parent->{$part} \|\|= {};" as in: `while (@parts) { my $part = shift @parts; $parent = $parent->{$part} \|\|= {}; next if @parts; $parent->{_name_} = $protein; }` [download] The other part of the code that I don't understand is "@best = ($parent->{_name_}, $wIndex)" as in `while ($wIndex < @words && exists $parent->{$words[$wIndex]}) { @best = ($parent->{_name_}, $wIndex) if exists $parent->{_name +_}; $parent = $parent->{$words[$wIndex++]}; }` [download] Thanks for the explaination.	[reply] [d/l] [select]
Re^4: Tag protein names in sentences by GrandFather (Saint) on Feb 18, 2010 at 02:58 UTC
@parts is a list of the 'words' in a protein name to be matched. The code builds %proteinLU as chains of nested keys. The value for each key is another hash except for _name_ keys whose value is a complete protein name. `$parent = $parent->{$part} \|\|= {};` sets the value of a new key to an empty hash. Using \|\|= in that way avoids an explicit `if ! exists $parent->{$part}` test. The match code works by 'walking' down a chain of nested hash keys. Each time a new key is matched its value becomes the next 'parent'. The assignment to @best 'remembers' the last protein name that matched. @best is an array because two values need to be remembered for the match: the protein name (`$parent->{_name_}`) and the number of words to remove (`$wIndex`). True laziness is hard work	[reply] [d/l] [select]
Re^5: Tag protein names in sentences by sinlam (Novice) on Feb 18, 2010 at 19:34 UTC
Re^6: Tag protein names in sentences by GrandFather (Saint) on Feb 18, 2010 at 19:53 UTC
Re^3: Tag protein names in sentences by Anonymous Monk on Feb 15, 2010 at 21:19 UTC
Thanks for the helpful technique. I learn something new. I have a question related to this technique. How can I do case insensitive comparison to tag protein name? If my sentence contains the name in the protein name list, but with different case, then I cannot find the matching. `my $sentence = 'The DOC4 protein , stress-induced flibbled the woozle +in the presence of gonadotrophin alpha 2 subunit'; __DATA__ gonadotrophin gonadotrophin alpha 2 subunit Doc4 protein , stress-induced` [download] Desired output: `The DOC4 protein , stress-induced flibbled the woozle in the prese +nce of gonadotrophin alpha 2 subunit` [download] It should match the longest protein name first, once the name is matched, we can ignore the matched protein name, and move on to next word. Thanks for any suggestion!	[reply] [d/l] [select]
Re^4: Tag protein names in sentences by GrandFather (Saint) on Feb 15, 2010 at 21:39 UTC
Use lc to lower case the words used as keys for the lookup hash and lower case the search text. The code already finds long protein names which have a prefix that is the same as a shorter protein's name. I've added a splice to avoid finding a short protein that matches a part of the long protein name that has just been found. Oh, and I updated the print to generate the desired output. use strict; use warnings; my $sentence = 'a long mixed case name protein is found in preference +to a mixed case name protein which is found before a short protein'; my %proteinLU; while (<DATA>) { chomp; my $protein = $_; my @parts = split; my $parent = \%proteinLU; while (@parts) { my $part = lc shift @parts; $parent = $parent->{$part} \|\|= {}; next if @parts; $parent->{_name_} = $protein; } } my @words = map {lc} split ' ', $sentence; while (@words) { my $word = shift @words; if (! exists $proteinLU{$word}) { print "$word "; next; } my $parent = $proteinLU{$word}; my $wIndex = 0; while ($wIndex < @words && exists $parent->{$words[$wIndex]}) { $parent = $parent->{$words[$wIndex++]} } print "$parent->{_name_} " if exists $parent->{_name_}; splice @words, 0, $wIndex; } __DATA__ long Mixed Case name protein Mixed Case name protein Protein [download] Prints: `a long Mixed Case name protein is found in preference to a Mixed + Case name protein which is found before a short Protein` [download] True laziness is hard work	[reply] [d/l] [select]
Re^5: Tag protein names in sentences by sinlam (Novice) on Feb 16, 2010 at 22:38 UTC
Re^6: Tag protein names in sentences by GrandFather (Saint) on Feb 16, 2010 at 22:56 UTC