Re^4: Tag protein names in sentences

Use lc to lower case the words used as keys for the lookup hash and lower case the search text. The code already finds long protein names which have a prefix that is the same as a shorter protein's name. I've added a splice to avoid finding a short protein that matches a part of the long protein name that has just been found. Oh, and I updated the print to generate the desired output.

use strict;
use warnings;

my $sentence = 'a long mixed case name protein is found in preference 
+to a mixed case name protein which is found before a short protein';

my %proteinLU;

while (<DATA>) {
    chomp;

    my $protein = $_;
    my @parts = split;
    my $parent = \%proteinLU;

    while (@parts) {
        my $part = lc shift @parts;

        $parent = $parent->{$part} ||= {};
        next if @parts;
        $parent->{_name_} = $protein;
    }
}

my @words = map {lc} split ' ', $sentence;

while (@words) {
    my $word = shift @words;

    if (! exists $proteinLU{$word}) {
        print "$word ";
        next;
    }

    my $parent = $proteinLU{$word};
    my $wIndex = 0;

    while ($wIndex < @words && exists $parent->{$words[$wIndex]}) {
        $parent = $parent->{$words[$wIndex++]}
    }

    print "**$parent->{_name_}** " if exists $parent->{_name_};
    splice @words, 0, $wIndex;
}


__DATA__
long Mixed Case name protein
Mixed Case name protein
Protein
[download]

Prints:

a **long Mixed Case name protein** is found in preference to a **Mixed
+ Case name protein** which is found before a short **Protein**
[download]

True laziness is hard work

Comment on Re^4: Tag protein names in sentences Select or Download Code

Replies are listed 'Best First'.
Re^5: Tag protein names in sentences by sinlam (Novice) on Feb 16, 2010 at 22:38 UTC
Hi, I notice a problem which does not tag short protein name and some words in the sentence are missing. But I don't know what is wrong in the script. I extracted the protein names which contains word "MCM2" and the test sentence which has problem. Is the problem related to redundancy in the protein names with different case? Thanks for any information! use strict; use warnings; my $sentence = 'Protein kinase A-anchoring protein AKAP95 interacts wi +th MCM2 , a regulator of DNA replication'; my %proteinLU; while (<DATA>) { chomp; my $protein = $_; my @parts = split; my $parent = \%proteinLU; while (@parts) { my $part = lc shift @parts; $parent = $parent->{$part} \|\|= {}; next if @parts; $parent->{_name_} = $protein; } } my @words = map {lc} split ' ', $sentence; while (@words) { my $word = shift @words; if (! exists $proteinLU{$word}) { print "$word "; next; } my $parent = $proteinLU{$word}; my $wIndex = 0; while ($wIndex < @words && exists $parent->{$words[$wIndex]}) { $parent = $parent->{$words[$wIndex++]} } print "$parent->{_name_} " if exists $parent->{_name_}; splice @words, 0, $wIndex; } __DATA__ replication licensing factor MCM2/3/5-type protein replication licensing factor MCM2 ALT_NAMES:MCM2p replication licensing factor MCM2 ALT_NAMES:CDCL1 minichromosome maintenance protein MCM2 homolog minichromosome maintenance protein MCM2 DNA replication licensing factor Mcm2 DNA replication licensing factor mcm2 DNA REPLICATION LICENSING FACTOR MCM2 mcm2 DNA replication licensing factor Mcm2 DNA replication licensing factor DNA replication licensing factor MCM2 contains MCM2/3/5 family signature Central kinetochore subunit MCM22 replication licensing factor MCM2 Central kinetochore subunit MCM21 Similarity to yeast mcm2 protein similarity to yeast mcm2 protein MCM2/3/5 family , most YBR1441 novel MCM2/3/5 family member NOVEL MCM2/3/5 FAMILY MEMBER budding yeast MCM2 homolog member of MCM2/3/5 family MCM complex subunit Mcm2 MCM2/3/5 family protein MCM2-related protein At5g18020/MCM23_11 At5g18030/MCM23_13 mcm2-prov protein MCM2 , FORMERLY MCM2/3/5 family P1 clone:MCM23 mcm2 protein Mcm2 protein MCM2 protein mcm2-prov MCM23_15 MCM23_13 MCM23.13 MCM23.11 MCM23.15 MCM23.16 MCM23_14 MCM23_16 MCM23.14 MCM23_11 MCM23_7 Mcm2-PA MCM23.9 MCM23.6 MCM23_4 Mcm2-RA MCM23.7 Mcm2-XP Mcm2-XR MCM23.4 MCM23_6 MCM23_1 MCM23_3 MCM23.3 MCM23.1 MCM23_2 MCM23.5 MCM23_9 MCM23_5 MCM23.2 mcm22p mcm2_1 Mcm22p mcm2_2 Mcm21p mcm21p MCM2_1 huMCM2 DmMcm2 MCM2_2 DmMCM2 X.MCM2 hMCM2 dMCM2 mcm22 MCM21 mMCM2 xMCM2 MCM2p MCM22 mcm2p mMcm2 mcm21 MMCM2 MCM2 mcm2 Mcm2 [download] Desired output will be: `Protein kinase A-anchoring protein AKAP95 interacts with MCM2 , a +regulator of DNA replication.` [download]	[reply] [d/l] [select]
Re^6: Tag protein names in sentences by GrandFather (Saint) on Feb 16, 2010 at 22:56 UTC
Replace: `my $parent = $proteinLU{$word}; my $wIndex = 0; while ($wIndex < @words && exists $parent->{$words[$wIndex]}) { $parent = $parent->{$words[$wIndex++]} } print "$parent->{_name_} " if exists $parent->{_name_}; splice @words, 0, $wIndex;` [download] with: `my $parent = $proteinLU{$word}; my $wIndex = 0; my @best; while ($wIndex < @words && exists $parent->{$words[$wIndex]}) { @best = ($parent->{_name_}, $wIndex) if exists $parent->{_name +_}; $parent = $parent->{$words[$wIndex++]}; } if (@best) { print "$best[0] "; splice @words, 0, $best[1]; } else { print "$word "; }` [download] There was no fall back to the longest match so far if a longer partial match existed. In the sample case 'MCM2' was being masked by 'MCM2 , FORMERLY'. There is no need (but no major harm) to provide different case versions of the same match string unless you want to use a case sensitive match with only the given variants allowed (in which case you need to remove lc in the various places it is used). True laziness is hard work	[reply] [d/l] [select]