in reply to Re^3: Tag protein names in sentences
in thread Tag protein names in sentences
Use lc to lower case the words used as keys for the lookup hash and lower case the search text. The code already finds long protein names which have a prefix that is the same as a shorter protein's name. I've added a splice to avoid finding a short protein that matches a part of the long protein name that has just been found. Oh, and I updated the print to generate the desired output.
use strict; use warnings; my $sentence = 'a long mixed case name protein is found in preference +to a mixed case name protein which is found before a short protein'; my %proteinLU; while (<DATA>) { chomp; my $protein = $_; my @parts = split; my $parent = \%proteinLU; while (@parts) { my $part = lc shift @parts; $parent = $parent->{$part} ||= {}; next if @parts; $parent->{_name_} = $protein; } } my @words = map {lc} split ' ', $sentence; while (@words) { my $word = shift @words; if (! exists $proteinLU{$word}) { print "$word "; next; } my $parent = $proteinLU{$word}; my $wIndex = 0; while ($wIndex < @words && exists $parent->{$words[$wIndex]}) { $parent = $parent->{$words[$wIndex++]} } print "**$parent->{_name_}** " if exists $parent->{_name_}; splice @words, 0, $wIndex; } __DATA__ long Mixed Case name protein Mixed Case name protein Protein
Prints:
a **long Mixed Case name protein** is found in preference to a **Mixed + Case name protein** which is found before a short **Protein**
|
|---|
| Replies are listed 'Best First'. | |
|---|---|
|
Re^5: Tag protein names in sentences
by sinlam (Novice) on Feb 16, 2010 at 22:38 UTC | |
by GrandFather (Saint) on Feb 16, 2010 at 22:56 UTC |