comment on

Use lc to lower case the words used as keys for the lookup hash and lower case the search text. The code already finds long protein names which have a prefix that is the same as a shorter protein's name. I've added a splice to avoid finding a short protein that matches a part of the long protein name that has just been found. Oh, and I updated the print to generate the desired output.

use strict;
use warnings;

my $sentence = 'a long mixed case name protein is found in preference 
+to a mixed case name protein which is found before a short protein';

my %proteinLU;

while (<DATA>) {
    chomp;

    my $protein = $_;
    my @parts = split;
    my $parent = \%proteinLU;

    while (@parts) {
        my $part = lc shift @parts;

        $parent = $parent->{$part} ||= {};
        next if @parts;
        $parent->{_name_} = $protein;
    }
}

my @words = map {lc} split ' ', $sentence;

while (@words) {
    my $word = shift @words;

    if (! exists $proteinLU{$word}) {
        print "$word ";
        next;
    }

    my $parent = $proteinLU{$word};
    my $wIndex = 0;

    while ($wIndex < @words && exists $parent->{$words[$wIndex]}) {
        $parent = $parent->{$words[$wIndex++]}
    }

    print "**$parent->{_name_}** " if exists $parent->{_name_};
    splice @words, 0, $wIndex;
}


__DATA__
long Mixed Case name protein
Mixed Case name protein
Protein
[download]

Prints:

a **long Mixed Case name protein** is found in preference to a **Mixed
+ Case name protein** which is found before a short **Protein**
[download]

True laziness is hard work

In reply to Re^4: Tag protein names in sentences by GrandFather
in thread Tag protein names in sentences by sinlam

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.