Replace:
my $parent = $proteinLU{$word}; my $wIndex = 0; while ($wIndex < @words && exists $parent->{$words[$wIndex]}) { $parent = $parent->{$words[$wIndex++]} } print "**$parent->{_name_}** " if exists $parent->{_name_}; splice @words, 0, $wIndex;
with:
my $parent = $proteinLU{$word}; my $wIndex = 0; my @best; while ($wIndex < @words && exists $parent->{$words[$wIndex]}) { @best = ($parent->{_name_}, $wIndex) if exists $parent->{_name +_}; $parent = $parent->{$words[$wIndex++]}; } if (@best) { print "**$best[0]** "; splice @words, 0, $best[1]; } else { print "$word "; }
There was no fall back to the longest match so far if a longer partial match existed. In the sample case 'MCM2' was being masked by 'MCM2 , FORMERLY'.
There is no need (but no major harm) to provide different case versions of the same match string unless you want to use a case sensitive match with only the given variants allowed (in which case you need to remove lc in the various places it is used).
In reply to Re^6: Tag protein names in sentences
by GrandFather
in thread Tag protein names in sentences
by sinlam
| For: | Use: | ||
| & | & | ||
| < | < | ||
| > | > | ||
| [ | [ | ||
| ] | ] |