Use lc to lower case the words used as keys for the lookup hash and lower case the search text. The code already finds long protein names which have a prefix that is the same as a shorter protein's name. I've added a splice to avoid finding a short protein that matches a part of the long protein name that has just been found. Oh, and I updated the print to generate the desired output.
use strict; use warnings; my $sentence = 'a long mixed case name protein is found in preference +to a mixed case name protein which is found before a short protein'; my %proteinLU; while (<DATA>) { chomp; my $protein = $_; my @parts = split; my $parent = \%proteinLU; while (@parts) { my $part = lc shift @parts; $parent = $parent->{$part} ||= {}; next if @parts; $parent->{_name_} = $protein; } } my @words = map {lc} split ' ', $sentence; while (@words) { my $word = shift @words; if (! exists $proteinLU{$word}) { print "$word "; next; } my $parent = $proteinLU{$word}; my $wIndex = 0; while ($wIndex < @words && exists $parent->{$words[$wIndex]}) { $parent = $parent->{$words[$wIndex++]} } print "**$parent->{_name_}** " if exists $parent->{_name_}; splice @words, 0, $wIndex; } __DATA__ long Mixed Case name protein Mixed Case name protein Protein
Prints:
a **long Mixed Case name protein** is found in preference to a **Mixed + Case name protein** which is found before a short **Protein**
In reply to Re^4: Tag protein names in sentences
by GrandFather
in thread Tag protein names in sentences
by sinlam
| For: | Use: | ||
| & | & | ||
| < | < | ||
| > | > | ||
| [ | [ | ||
| ] | ] |