pdahal has asked for the wisdom of the Perl Monks concerning the following question:

Hello monks, I was seeking for a help. Actually I have a list of drugs that are related to thrombosis. And what I need to do is list out the proteins that are targeted by those drugs. What I did till now is, I extracted the PubMed IDs that refers to those drugs. I have a file containing the proteins that are related to blood coagulation. If any of the proteins in the file are mentioned in the abstract, I listed them. But the problem is, in the abstract, there are few words that is a protein name but is not actually a protein. The word refers to some other context. For example, "Xa" is also a protein(Factor Xa) but sometimes Xa is found in the abstract in some other context. I need to eliminate such words. And I am confused how can I do this. Any helps are appreciated.

Replies are listed 'Best First'.
Re: Recognizing protein name
by erix (Prior) on May 20, 2017 at 15:07 UTC

    I'd first make a large list of wanted protein names. UniProt is a good resource. It lets you filter by organism (I assumed Homo sapiens but maybe you want to include other model organisms)

    On uniprot.org go directly to 'advanced' and select whatever you think pertains to your case.

    It's easy to use the webform, using queries like:

    taxonomy:"Homo sapiens (Human) [9606]" AND ( thrombosis OR coagulation )

    (that yields ~500 records)

    You can filter by any term, and/or using Keywords, GO terms, domain names, etc.

    The 'Download" button then contains the url to download (as fasta, text, or whatever). You can also prescribe the columns needed (when viewing in the browser or downloading the selection as csv).

    Easy to use interactively and also not hard to build these urls yourself.

    Once you have limited your list of protein names in this way, you can also use them to filter down the pubmed records.

    Possibly useful too: MesH and Entrez E-Utilities (there is a perl variant)

Re: Recognizing protein name -- Crystal::Ball
by Discipulus (Canon) on May 20, 2017 at 12:38 UTC
    hello pdahal,

    unfortunately the Crystal::Ball module is not on CPAN..

    It would be an useful module for me to know what are you parsing and for you to find proteines that are good against thrombosis in your texts. But..

    If I understand you want to tell between Xa proteine name from Liu Xa a fantasy chinese name.

    This is not possible..

    I suspect some manual work is needed in your case.

    The best I can imagine is the following: given a list of proteine, build up a hash with proteine name as key and an empty array as value. Parse your texts searching proteine names and store in the array some sourronding text: like $prot{Xa}=["the proteine Xa is good for"]

    It can happen that you store something $prot{Xa}=["as the famous Liu Xa said to the queen"]

    Then choose some human redable dataformat to save your data and clean it by hand: many options are available, YAML among them.

    It is still risky because you can parse something like the proteine Xa used in the terapy followed by the text is now considered harmfull

    So, because you are working on sensible things, the best will be to store also the name of text and the line number, to check it by hand: like $prot{Xa}=["Necronomicon,42,the proteine Xa is good for"]

    L*

    There are no rules, there are no thumbs..
    Reinvent the wheel, then learn The Wheel; may be one day you reinvent one of THE WHEELS.
      hello Discipulus, I have attempted to do something. I read few abstracts and listed out the words that can possibly around the protein name. For example if "Xa" is a protein name then it may appear in the form like "xa inhibitor", "xa activity" and so on. I have listed those words. Can you now suggest me what to do further?
        One example is not enough
Re: Recognizing protein name
by shmem (Chancellor) on May 20, 2017 at 14:31 UTC

    You could create a blacklist of longer unwanted sequences comprising the term, match those first, and skip them in your matching algorithm. Lather, rinse, repeat.

    perl -le'print map{pack c,($-++?1:13)+ord}split//,ESEL'
Re: Recognizing protein name
by Marshall (Canon) on May 20, 2017 at 16:37 UTC
    Ok, you have some PubMed abstracts. You want to filter those documents by some criteria. Because you have not provided any simple sample data and your desired output, I like others here, have no idea of how to even begin to help you.

    Create the most simple example that you can showing the input and the desired output.
    If you can show some Perl code, I am sure the Monks can help quite a bit on the refinement of that code.

Re: Recognizing protein name
by thanos1983 (Parson) on May 20, 2017 at 15:33 UTC

    Hello pdahal,

    To be honest I never used Perl for this kind of area, so I can not really say that I know how assist you. I did a very short research though and I found this question (translating multiple DNA sequence to protein sequence).

    Well although the question may not be what you are looking for, but a fellow monk (polypompholyx) proposes to use BioPerl which I think is what you should work on and possible resolve many problems on the future.

    Here is also a nice tutorial (Perl and Bioinformatics).

    Hope this helps.

    Seeking for Perl wisdom...on the process of learning...not there...yet!
Re: Recognizing protein name
by Anonymous Monk on May 20, 2017 at 12:29 UTC
    More info please - show us code, sample input, and lots of examples of desired and undesired matches