Tag protein names in sentences

sinlam has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: Tag protein names in sentences by BrowserUk (Patriarch) on Feb 12, 2010 at 19:41 UTC
Are you using perl v5.10.0 or later? I suspect that you may be falling foul of a pathological case of the regex Trie optimisation. See 5.10.0 regex slowdown for some discussion. You might find your performance improves if you include: `BEGIN{ ${^RE_TRIE_MAXBUF} = 0; }` [download] at the top of your program. Note: That's just one possibility that I draw from the referenced thread. Things may have move on since then. Another possibility would be to break your 7 million alternations regex up into smaller ones. For example, you could group them into subsets keyed by the first "word", and build an alternation regex from each group. You'd then interate over the hash of re's and only invoke the full sub-re on the sentence if it contains the first word. Something like: `@proteinNames = ...; my %res; m[^(\S+)] and push @{ $res{ $1 } }, $_ for @proteinNames; $res{ $_ } = join '\|', reverse sort map { quotemeta } @{ $res{ $_ } } for keys %res; while( my $line = <> ) { for my $key ( keys %res ) { if( $line =~ $key ) { $line =~ s[($re{ $key })][\\$1\\]ig; print $line; } } }` [download] Whether that would help or hinder is impossible to say without testing it on real data. Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. "Science is about questioning the status quo. Questioning authority". In the absence of evidence, opinion is indistinguishable from prejudice. "I'd rather go naked than blow up my ass"	[reply] [d/l] [select]
Re: Tag protein names in sentences by BioLion (Curate) on Feb 12, 2010 at 19:04 UTC
Searching over and over is going to take forever! You would do better putting your time into computing a lookup table of the names, then breaking your sentences down into tokens (words) and checking for their existence in the lookup table. You could use either a hash, if you have the memory, or a database. Have a look around, maybe time spent now will save you time in the long run, but if you are only doing this search the once, maybe just take the hit of long computation time, and get onto the next problem. Sorry I can't be of more help, but I am going home! Just a something something...	[reply]
Re: Tag protein names in sentences by zwon (Abbot) on Feb 12, 2010 at 18:57 UTC
It would help if you show us sample data. Could you post example of `@proteinnam` and one (or more) of these lines on which it takes 2 seconds to match?	[reply] [d/l]
Re: Tag protein names in sentences by sinlam (Novice) on Feb 12, 2010 at 19:56 UTC
Sample protein names: 7-phospho-2-dehydro-3-deoxy-D-arabino-heptonate D-erythrose-4-phosphat +e lyase (pyruvate-phosphorylating) gamma-glutamyl-gamma-aminobutyraldehyde dehydrogenase (Gamma-Glu-gamma +-aminobutyraldehyde dehydrogenase) 3-phosphoshikimate 1-carboxyvinyltransferase (5-enolpyruvylshikimate-3 +-phosphate synthase)(EPSP synthase Hypothetical protein CBG17340 gonadotrophin alpha 2 subunit Doc4 protein , stress-induced optomotor-blind Dfrizzled-3 Tramtrack69 gutfeeling betaFTZ-F1 Sex-lethal Strabismus PAR3alpha Armadillo AP-2alpha GLUT1CBP eIF3-p44 Flamingo PP2Czeta PLCgamma TFIIIC90 Wingless Frizzled Profilin TXBP151 [download] `Sample sentences to tag on their protein names: The present p65(91) (arf)71 results show that the rate of Ca(2+)-induc +ed structural transition and Ca(2+) sensitivity of the inhibitory reg +ion of cTnI were modified by ( 1 ) thin filament formation , ( 2 ) th +e presence of strongly bound S1 , and ( 3 ) PKA phosphorylation of th +e N-terminus of cTnI` [download] The searching for the above sentence takes about 9 sec (not 2 sec as I mentioned earlier as I was using smaller number of protein names) when using 7 million protein names.	[reply] [d/l] [select]
Re^2: Tag protein names in sentences by GrandFather (Saint) on Feb 12, 2010 at 22:44 UTC
I'd build a lookup table then walk over the sentence one word at a time looking for matches: use strict; use warnings; my $sentence = 'The Doc4 protein , stress-induced flibbled the woozle +in the presence of gonadotrophin alpha 2 subunit'; my %proteinLU; while (<DATA>) { chomp; my $protein = $_; my @parts = split; my $parent = \%proteinLU; while (@parts) { my $part = shift @parts; $parent = $parent->{$part} \|\|= {}; next if @parts; $parent->{_name_} = $protein; } } my @words = split ' ', $sentence; while (@words) { my $word = shift @words; next if ! exists $proteinLU{$word}; my $parent = $proteinLU{$word}; my $wIndex = 0; while ($wIndex < @words && exists $parent->{$words[$wIndex]}) { $parent = $parent->{$words[$wIndex++]} } print "$parent->{_name_}\n" if exists $parent->{_name_}; } __DATA__ 7-phospho-2-dehydro-3-deoxy-D-arabino-heptonate D-erythrose-4-phosphat +e lyase (pyruvate-phosphorylating) gamma-glutamyl-gamma-aminobutyraldehyde dehydrogenase (Gamma-Glu-gamma +-aminobutyraldehyde dehydrogenase) 3-phosphoshikimate 1-carboxyvinyltransferase (5-enolpyruvylshikimate-3 +-phosphate synthase)(EPSP synthase Hypothetical protein CBG17340 gonadotrophin alpha 2 subunit Doc4 protein , stress-induced optomotor-blind Dfrizzled-3 Tramtrack69 gutfeeling betaFTZ-F1 Sex-lethal Strabismus PAR3alpha Armadillo AP-2alpha GLUT1CBP eIF3-p44 Flamingo PP2Czeta PLCgamma TFIIIC90 Wingless Frizzled Profilin TXBP151 [download] Prints: `Doc4 protein , stress-induced gonadotrophin alpha 2 subunit` [download] I invented a sentence because the sample sentence didn't seem to include any of the sample proteins so didn't provide a very interesting test case! Very likely you will have to normalize the protein names in some fashion and normalize the sentence likewise so that variations in punctuation and white space usage don't prevent valid matches. But you'd have had to do that in any case so I guess you have that sorted out. True laziness is hard work	[reply] [d/l] [select]
Re^3: Tag protein names in sentences by sinlam (Novice) on Feb 17, 2010 at 23:11 UTC
Hi, can you please explain the following code means? I don't understand the sentence "$parent = $parent->{$part} \|\|= {};" as in: `while (@parts) { my $part = shift @parts; $parent = $parent->{$part} \|\|= {}; next if @parts; $parent->{_name_} = $protein; }` [download] The other part of the code that I don't understand is "@best = ($parent->{_name_}, $wIndex)" as in `while ($wIndex < @words && exists $parent->{$words[$wIndex]}) { @best = ($parent->{_name_}, $wIndex) if exists $parent->{_name +_}; $parent = $parent->{$words[$wIndex++]}; }` [download] Thanks for the explaination.	[reply] [d/l] [select]
Re^4: Tag protein names in sentences by GrandFather (Saint) on Feb 18, 2010 at 02:58 UTC
Re^5: Tag protein names in sentences by sinlam (Novice) on Feb 18, 2010 at 19:34 UTC
Some notes below your chosen depth have not been shown here
Re^3: Tag protein names in sentences by sinlam (Novice) on Feb 16, 2010 at 17:13 UTC
Thank you so much for the suggestion. The approach works very fast to process each sentence.	[reply]
Re^3: Tag protein names in sentences by Anonymous Monk on Feb 15, 2010 at 21:19 UTC
Thanks for the helpful technique. I learn something new. I have a question related to this technique. How can I do case insensitive comparison to tag protein name? If my sentence contains the name in the protein name list, but with different case, then I cannot find the matching. `my $sentence = 'The DOC4 protein , stress-induced flibbled the woozle +in the presence of gonadotrophin alpha 2 subunit'; __DATA__ gonadotrophin gonadotrophin alpha 2 subunit Doc4 protein , stress-induced` [download] Desired output: `The DOC4 protein , stress-induced flibbled the woozle in the prese +nce of gonadotrophin alpha 2 subunit` [download] It should match the longest protein name first, once the name is matched, we can ignore the matched protein name, and move on to next word. Thanks for any suggestion!	[reply] [d/l] [select]
Re^4: Tag protein names in sentences by GrandFather (Saint) on Feb 15, 2010 at 21:39 UTC
Re^5: Tag protein names in sentences by sinlam (Novice) on Feb 16, 2010 at 22:38 UTC
Some notes below your chosen depth have not been shown here