Recognizing protein name

pdahal has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: Recognizing protein name by erix (Prior) on May 20, 2017 at 15:07 UTC
I'd first make a large list of wanted protein names. UniProt is a good resource. It lets you filter by organism (I assumed Homo sapiens but maybe you want to include other model organisms) On uniprot.org go directly to 'advanced' and select whatever you think pertains to your case. It's easy to use the webform, using queries like: `taxonomy:"Homo sapiens (Human) [9606]" AND ( thrombosis OR coagulation )` (that yields ~500 records) You can filter by any term, and/or using Keywords, GO terms, domain names, etc. The 'Download" button then contains the url to download (as fasta, text, or whatever). You can also prescribe the columns needed (when viewing in the browser or downloading the selection as csv). Easy to use interactively and also not hard to build these urls yourself. Once you have limited your list of protein names in this way, you can also use them to filter down the pubmed records. Possibly useful too: MesH and Entrez E-Utilities (there is a perl variant)	[reply] [d/l]
Re: Recognizing protein name -- Crystal::Ball by Discipulus (Canon) on May 20, 2017 at 12:38 UTC
hello pdahal, unfortunately the `Crystal::Ball` module is not on CPAN.. It would be an useful module for me to know what are you parsing and for you to find proteines that are good against thrombosis in your texts. But.. If I understand you want to tell between `Xa` proteine name from `Liu Xa` a fantasy chinese name. This is not possible.. I suspect some manual work is needed in your case. The best I can imagine is the following: given a list of proteine, build up a hash with proteine name as key and an empty array as value. Parse your texts searching proteine names and store in the array some sourronding text: like `$prot{Xa}=["the proteine Xa is good for"]` It can happen that you store something `$prot{Xa}=["as the famous Liu Xa said to the queen"]` Then choose some human redable dataformat to save your data and clean it by hand: many options are available, YAML among them. It is still risky because you can parse something like `the proteine Xa used in the terapy` followed by the text `is now considered harmfull` So, because you are working on sensible things, the best will be to store also the name of text and the line number, to check it by hand: like `$prot{Xa}=["Necronomicon,42,the proteine Xa is good for"]` L* There are no rules, there are no thumbs.. Reinvent the wheel, then learn The Wheel; may be one day you reinvent one of THE WHEELS.	[reply] [d/l] [select]
Re^2: Recognizing protein name -- Crystal::Ball by pdahal (Acolyte) on May 20, 2017 at 12:53 UTC
hello Discipulus, I have attempted to do something. I read few abstracts and listed out the words that can possibly around the protein name. For example if "Xa" is a protein name then it may appear in the form like "xa inhibitor", "xa activity" and so on. I have listed those words. Can you now suggest me what to do further?	[reply]
Re^3: Recognizing protein name -- Crystal::Ball by Anonymous Monk on May 20, 2017 at 13:03 UTC
One example is not enough	[reply]
Re: Recognizing protein name by shmem (Chancellor) on May 20, 2017 at 14:31 UTC
You could create a blacklist of longer unwanted sequences comprising the term, match those first, and skip them in your matching algorithm. Lather, rinse, repeat. perl -le'print map{pack c,($-++?1:13)+ord}split//,ESEL'	[reply]
Re: Recognizing protein name by Marshall (Canon) on May 20, 2017 at 16:37 UTC
Ok, you have some PubMed abstracts. You want to filter those documents by some criteria. Because you have not provided any simple sample data and your desired output, I like others here, have no idea of how to even begin to help you. Create the most simple example that you can showing the input and the desired output. If you can show some Perl code, I am sure the Monks can help quite a bit on the refinement of that code.	[reply]
Re: Recognizing protein name by thanos1983 (Parson) on May 20, 2017 at 15:33 UTC
Hello pdahal, To be honest I never used Perl for this kind of area, so I can not really say that I know how assist you. I did a very short research though and I found this question (translating multiple DNA sequence to protein sequence). Well although the question may not be what you are looking for, but a fellow monk (polypompholyx) proposes to use BioPerl which I think is what you should work on and possible resolve many problems on the future. Here is also a nice tutorial (Perl and Bioinformatics). Hope this helps. Seeking for Perl wisdom...on the process of learning...not there...yet!	[reply] [d/l] [select]
Re: Recognizing protein name by Anonymous Monk on May 20, 2017 at 12:29 UTC
More info please - show us code, sample input, and lots of examples of desired and undesired matches	[reply]