in reply to Re^5: Tag protein names in sentences
in thread Tag protein names in sentences

Replace:

my $parent = $proteinLU{$word}; my $wIndex = 0; while ($wIndex < @words && exists $parent->{$words[$wIndex]}) { $parent = $parent->{$words[$wIndex++]} } print "**$parent->{_name_}** " if exists $parent->{_name_}; splice @words, 0, $wIndex;

with:

my $parent = $proteinLU{$word}; my $wIndex = 0; my @best; while ($wIndex < @words && exists $parent->{$words[$wIndex]}) { @best = ($parent->{_name_}, $wIndex) if exists $parent->{_name +_}; $parent = $parent->{$words[$wIndex++]}; } if (@best) { print "**$best[0]** "; splice @words, 0, $best[1]; } else { print "$word "; }

There was no fall back to the longest match so far if a longer partial match existed. In the sample case 'MCM2' was being masked by 'MCM2 , FORMERLY'.

There is no need (but no major harm) to provide different case versions of the same match string unless you want to use a case sensitive match with only the given variants allowed (in which case you need to remove lc in the various places it is used).


True laziness is hard work