stanleysj has asked for the wisdom of the Perl Monks concerning the following question:
I am beginner in PERL and I need ur help in parsing a Uniprot text file and creating an output file of protein terms
My input file looks like this; (this is just a part of the file the real file is available at http://www.uniprot.org/uniprot/?query=organism%3a%22Plasmodium+falciparum+%5b5833%5d%22&force=yes&format=txt)
// ID ARF1_PLAFA Reviewed; 181 AA. AC Q94650; O02502; O02593; DT 15-JUL-1998, integrated into UniProtKB/Swiss-Prot. DT 23-JAN-2007, sequence version 3. DT 25-NOV-2008, entry version 52. DE RecName: Full=ADP-ribosylation factor 1; GN Name=ARF1; Synonyms=ARF, PLARF; OS Plasmodium falciparum. OC Eukaryota; Alveolata; Apicomplexa; Aconoidasida; Haemosporida; OC Plasmodium; Plasmodium (Laverania). OX NCBI_TaxID=5833; RN [1] RP NUCLEOTIDE SEQUENCE [GENOMIC DNA]. RC STRAIN=T9/96; TISSUE=Blood; RX MEDLINE=97112480; PubMed=8954160; RX DOI=10.1111/j.1432-1033.1996.0104r.x; RA Stafford W.H., Stockley R.W., Ludbrook S.B., Holder A.A.; RT "Isolation, expression and characterization of the gene for an AD +P- RT ribosylation factor from the human malaria parasite, Plasmodium RT falciparum."; RL Eur. J. Biochem. 242:104-113(1996). RN [2] RP NUCLEOTIDE SEQUENCE [MRNA]. RX MEDLINE=97237566; PubMed=9084044; DOI=10.1016/S0166-6851(96)02803 +-4; RA Truong R.M., Francis S.E., Chakrabarti D., Goldberg D.E.; RT "Cloning and characterization of Plasmodium falciparum ADP- RT ribosylation factor and factor-like genes."; RL Mol. Biochem. Parasitol. 84:247-253(1997). CC -!- FUNCTION: GTP-binding protein that functions as an allosteric CC activator of the cholera toxin catalytic subunit, an ADP- CC ribosyltransferase. Involved in protein trafficking; may modu +late CC vesicle budding and uncoating within the Golgi apparatus (By CC similarity). CC -!- SUBCELLULAR LOCATION: Golgi apparatus (By similarity). CC -!- SIMILARITY: Belongs to the small GTPase superfamily. Arf fami +ly. CC ----------------------------------------------------------------- +------ CC Copyrighted by the UniProt Consortium, see http://www.uniprot.org +/terms CC Distributed under the Creative Commons Attribution-NoDerivs Licen +se CC ----------------------------------------------------------------- +------ DR EMBL; Z80359; CAB02498.1; -; Genomic_DNA. DR EMBL; U57370; AAB63304.1; -; mRNA. DR HSSP; P32889; 1RRF. DR SMR; Q94650; 6-179. DR GO; GO:0005794; C:Golgi apparatus; IEA:UniProtKB-KW. DR GO; GO:0005525; F:GTP binding; IEA:InterPro. DR GO; GO:0015031; P:protein transport; IEA:UniProtKB-KW. DR GO; GO:0007264; P:small GTPase mediated signal transduction; IEA: +InterPro. DR GO; GO:0016192; P:vesicle-mediated transport; IEA:UniProtKB-KW. DR InterPro; IPR006688; ARF. DR InterPro; IPR006689; ARF/SAR. DR InterPro; IPR001806; Ras_trnsfrmng. DR InterPro; IPR005225; Small_GTP_bd. DR PANTHER; PTHR11711; ARF/SAR; 1. DR Pfam; PF00025; Arf; 1. DR PRINTS; PR00449; RASTRNSFRMNG. DR PRINTS; PR00328; SAR1GTPBP. DR SMART; SM00177; ARF; 1. DR TIGRFAMs; TIGR00231; small_GTP; 1. DR PROSITE; PS01019; ARF; 1. PE 2: Evidence at transcript level; KW ER-Golgi transport; Golgi apparatus; GTP-binding; Lipoprotein; KW Myristate; Nucleotide-binding; Protein transport; Transport. FT INIT_MET 1 1 Removed (Potential). FT CHAIN 2 181 ADP-ribosylation factor 1. FT /FTId=PRO_0000207447. FT NP_BIND 24 31 GTP (By similarity). FT NP_BIND 67 71 GTP (By similarity). FT NP_BIND 126 129 GTP (By similarity). FT LIPID 2 2 N-myristoyl glycine (Potential). SQ SEQUENCE 181 AA; 20912 MW; 18013B069BEA2413 CRC64; MGLYVSRLFN RLFQKKDVRI LMVGLDAAGK TTILYKVKLG EVVTTIPTIG FNVETVEFRN ISFTVWDVGG QDKIRPLWRH YYSNTDGLIF VVDSNDRERI DDAREELHRM INEEELKDAI ILVFANKQDL PNAMSAAEVT EKLHLNTIRE RNWFIQSTCA TRGDGLYEGF DWLTTHLNNA K //
I am interested in lines starting with DE and GN and I want the text between = and ; The input line separator for each entry will be //. After that details of a new text is obtained. I want my out put to look like this
ADP-ribosylation factor 1 ARF1 ARF, PLARF
I have written a small code for it.pls have look at
while (<>) { @lines = grep {/^DE|^GN|^ID/} split ("\n", $_); foreach $lines(@lines) { if ($lines =~ /^DE|^GN/ && $lines !~ /Putative uncharacterized + protein/) { $lines =~ /.+\=(.+)\;/; print lc($1)."\n"; } elsif ($lines =~ /^ID/) { print " \n"; } } }
my problem is that my code does not grab the text between = and ; in the line starting with GN ...especially the one after Names= ;
the next thing I want is to avoid duplicates in my output file ..I have tried many commonly used codes mentioned in the various other posts over here. but it did not work for me.
my final output file should be like this </p> 101 kda malaria antigen p101 acidic basic repeat antigen pfl1385c actin-1 actin i pfl2215w actin-2 actin ii pf14_0124 fructose-bisphosphate aldolase 4.1.2.13 pf14_0425 acidic leucine-rich nuclear phosphoprotein 32-related protein anp32/acidic nuclear phosphoprotein-like protein pf14_0257
|
|---|
| Replies are listed 'Best First'. | |
|---|---|
|
Re: how to parse a UniProt Flat file
by kennethk (Abbot) on Dec 09, 2008 at 19:59 UTC | |
|
Re: how to parse a UniProt Flat file
by toolic (Bishop) on Dec 09, 2008 at 20:18 UTC | |
|
Re: how to parse a UniProt Flat file
by ig (Vicar) on Dec 09, 2008 at 22:28 UTC | |
by stanleysj (Novice) on Dec 10, 2008 at 09:03 UTC | |
by stanleysj (Novice) on Dec 10, 2008 at 10:16 UTC | |
by kennethk (Abbot) on Dec 10, 2008 at 17:47 UTC | |
by ig (Vicar) on Dec 10, 2008 at 22:03 UTC | |
|
Re: how to parse a UniProt Flat file
by ig (Vicar) on Dec 11, 2008 at 00:23 UTC | |
by stanleysj (Novice) on Dec 30, 2008 at 10:11 UTC |