Hi, I notice a problem which does not tag short protein name and some words in the sentence are missing. But I don't know what is wrong in the script. I extracted the protein names which contains word "MCM2" and the test sentence which has problem. Is the problem related to redundancy in the protein names with different case? Thanks for any information!
use strict; use warnings; my $sentence = 'Protein kinase A-anchoring protein AKAP95 interacts wi +th MCM2 , a regulator of DNA replication'; my %proteinLU; while (<DATA>) { chomp; my $protein = $_; my @parts = split; my $parent = \%proteinLU; while (@parts) { my $part = lc shift @parts; $parent = $parent->{$part} ||= {}; next if @parts; $parent->{_name_} = $protein; } } my @words = map {lc} split ' ', $sentence; while (@words) { my $word = shift @words; if (! exists $proteinLU{$word}) { print "$word "; next; } my $parent = $proteinLU{$word}; my $wIndex = 0; while ($wIndex < @words && exists $parent->{$words[$wIndex]}) { $parent = $parent->{$words[$wIndex++]} } print "**$parent->{_name_}** " if exists $parent->{_name_}; splice @words, 0, $wIndex; } __DATA__ replication licensing factor MCM2/3/5-type protein replication licensing factor MCM2 ALT_NAMES:MCM2p replication licensing factor MCM2 ALT_NAMES:CDCL1 minichromosome maintenance protein MCM2 homolog minichromosome maintenance protein MCM2 DNA replication licensing factor Mcm2 DNA replication licensing factor mcm2 DNA REPLICATION LICENSING FACTOR MCM2 mcm2 DNA replication licensing factor Mcm2 DNA replication licensing factor DNA replication licensing factor MCM2 contains MCM2/3/5 family signature Central kinetochore subunit MCM22 replication licensing factor MCM2 Central kinetochore subunit MCM21 Similarity to yeast mcm2 protein similarity to yeast mcm2 protein MCM2/3/5 family , most YBR1441 novel MCM2/3/5 family member NOVEL MCM2/3/5 FAMILY MEMBER budding yeast MCM2 homolog member of MCM2/3/5 family MCM complex subunit Mcm2 MCM2/3/5 family protein MCM2-related protein At5g18020/MCM23_11 At5g18030/MCM23_13 mcm2-prov protein MCM2 , FORMERLY MCM2/3/5 family P1 clone:MCM23 mcm2 protein Mcm2 protein MCM2 protein mcm2-prov MCM23_15 MCM23_13 MCM23.13 MCM23.11 MCM23.15 MCM23.16 MCM23_14 MCM23_16 MCM23.14 MCM23_11 MCM23_7 Mcm2-PA MCM23.9 MCM23.6 MCM23_4 Mcm2-RA MCM23.7 Mcm2-XP Mcm2-XR MCM23.4 MCM23_6 MCM23_1 MCM23_3 MCM23.3 MCM23.1 MCM23_2 MCM23.5 MCM23_9 MCM23_5 MCM23.2 mcm22p mcm2_1 Mcm22p mcm2_2 Mcm21p mcm21p MCM2_1 huMCM2 DmMcm2 MCM2_2 DmMCM2 X.MCM2 hMCM2 dMCM2 mcm22 MCM21 mMCM2 xMCM2 MCM2p MCM22 mcm2p mMcm2 mcm21 MMCM2 MCM2 mcm2 Mcm2
Desired output will be:
Protein kinase A-anchoring protein AKAP95 interacts with **MCM2** , a +regulator of DNA replication.

In reply to Re^5: Tag protein names in sentences by sinlam
in thread Tag protein names in sentences by sinlam

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.