how to parse a UniProt Flat file

stanleysj has asked for the wisdom of the Perl Monks concerning the following question:

Hi guys,

I am beginner in PERL and I need ur help in parsing a Uniprot text file and creating an output file of protein terms

My input file looks like this; (this is just a part of the file the real file is available at http://www.uniprot.org/uniprot/?query=organism%3a%22Plasmodium+falciparum+%5b5833%5d%22&force=yes&format=txt)

//
ID   ARF1_PLAFA              Reviewed;         181 AA.
AC   Q94650; O02502; O02593;
DT   15-JUL-1998, integrated into UniProtKB/Swiss-Prot.
DT   23-JAN-2007, sequence version 3.
DT   25-NOV-2008, entry version 52.
DE   RecName: Full=ADP-ribosylation factor 1;
GN   Name=ARF1; Synonyms=ARF, PLARF;
OS   Plasmodium falciparum.
OC   Eukaryota; Alveolata; Apicomplexa; Aconoidasida; Haemosporida;
OC   Plasmodium; Plasmodium (Laverania).
OX   NCBI_TaxID=5833;
RN   [1]
RP   NUCLEOTIDE SEQUENCE [GENOMIC DNA].
RC   STRAIN=T9/96; TISSUE=Blood;
RX   MEDLINE=97112480; PubMed=8954160;
RX   DOI=10.1111/j.1432-1033.1996.0104r.x;
RA   Stafford W.H., Stockley R.W., Ludbrook S.B., Holder A.A.;
RT   "Isolation, expression and characterization of the gene for an AD
+P-
RT   ribosylation factor from the human malaria parasite, Plasmodium
RT   falciparum.";
RL   Eur. J. Biochem. 242:104-113(1996).
RN   [2]
RP   NUCLEOTIDE SEQUENCE [MRNA].
RX   MEDLINE=97237566; PubMed=9084044; DOI=10.1016/S0166-6851(96)02803
+-4;
RA   Truong R.M., Francis S.E., Chakrabarti D., Goldberg D.E.;
RT   "Cloning and characterization of Plasmodium falciparum ADP-
RT   ribosylation factor and factor-like genes.";
RL   Mol. Biochem. Parasitol. 84:247-253(1997).
CC   -!- FUNCTION: GTP-binding protein that functions as an allosteric
CC       activator of the cholera toxin catalytic subunit, an ADP-
CC       ribosyltransferase. Involved in protein trafficking; may modu
+late
CC       vesicle budding and uncoating within the Golgi apparatus (By
CC       similarity).
CC   -!- SUBCELLULAR LOCATION: Golgi apparatus (By similarity).
CC   -!- SIMILARITY: Belongs to the small GTPase superfamily. Arf fami
+ly.
CC   -----------------------------------------------------------------
+------
CC   Copyrighted by the UniProt Consortium, see http://www.uniprot.org
+/terms
CC   Distributed under the Creative Commons Attribution-NoDerivs Licen
+se
CC   -----------------------------------------------------------------
+------
DR   EMBL; Z80359; CAB02498.1; -; Genomic_DNA.
DR   EMBL; U57370; AAB63304.1; -; mRNA.
DR   HSSP; P32889; 1RRF.
DR   SMR; Q94650; 6-179.
DR   GO; GO:0005794; C:Golgi apparatus; IEA:UniProtKB-KW.
DR   GO; GO:0005525; F:GTP binding; IEA:InterPro.
DR   GO; GO:0015031; P:protein transport; IEA:UniProtKB-KW.
DR   GO; GO:0007264; P:small GTPase mediated signal transduction; IEA:
+InterPro.
DR   GO; GO:0016192; P:vesicle-mediated transport; IEA:UniProtKB-KW.
DR   InterPro; IPR006688; ARF.
DR   InterPro; IPR006689; ARF/SAR.
DR   InterPro; IPR001806; Ras_trnsfrmng.
DR   InterPro; IPR005225; Small_GTP_bd.
DR   PANTHER; PTHR11711; ARF/SAR; 1.
DR   Pfam; PF00025; Arf; 1.
DR   PRINTS; PR00449; RASTRNSFRMNG.
DR   PRINTS; PR00328; SAR1GTPBP.
DR   SMART; SM00177; ARF; 1.
DR   TIGRFAMs; TIGR00231; small_GTP; 1.
DR   PROSITE; PS01019; ARF; 1.
PE   2: Evidence at transcript level;
KW   ER-Golgi transport; Golgi apparatus; GTP-binding; Lipoprotein;
KW   Myristate; Nucleotide-binding; Protein transport; Transport.
FT   INIT_MET      1      1       Removed (Potential).
FT   CHAIN         2    181       ADP-ribosylation factor 1.
FT                                /FTId=PRO_0000207447.
FT   NP_BIND      24     31       GTP (By similarity).
FT   NP_BIND      67     71       GTP (By similarity).
FT   NP_BIND     126    129       GTP (By similarity).
FT   LIPID         2      2       N-myristoyl glycine (Potential).
SQ   SEQUENCE   181 AA;  20912 MW;  18013B069BEA2413 CRC64;
     MGLYVSRLFN RLFQKKDVRI LMVGLDAAGK TTILYKVKLG EVVTTIPTIG FNVETVEFRN
     ISFTVWDVGG QDKIRPLWRH YYSNTDGLIF VVDSNDRERI DDAREELHRM INEEELKDAI
     ILVFANKQDL PNAMSAAEVT EKLHLNTIRE RNWFIQSTCA TRGDGLYEGF DWLTTHLNNA
     K
//
[download]

I am interested in lines starting with DE and GN and I want the text between = and
; The input line separator for each entry will be //. After that details of a new text is obtained. I want my out put to look like this

ADP-ribosylation factor 1
ARF1
ARF, PLARF
[download]

I have written a small code for it.pls have look at

while (<>) {
@lines =  grep {/^DE|^GN|^ID/} split ("\n", $_);
foreach $lines(@lines) {
        if ($lines =~ /^DE|^GN/ && $lines !~ /Putative uncharacterized
+ protein/)    {
        $lines =~ /.+\=(.+)\;/;
        print lc($1)."\n";
        } elsif    ($lines =~ /^ID/)    {
        print " \n";
        }
}
}
[download]

my problem is that my code does not grab the text between = and ; in the line starting with GN ...especially the one after Names= ;

the next thing I want is to avoid duplicates in my output file ..I have tried many commonly used codes mentioned in the various other posts over here. but it did not work for me.

my final output file should be like this
</p>
101 kda malaria antigen
p101
acidic basic repeat antigen
pfl1385c
 
actin-1
actin i
pfl2215w
 
actin-2
actin ii
pf14_0124
 
fructose-bisphosphate aldolase
4.1.2.13
pf14_0425
 
acidic leucine-rich nuclear phosphoprotein 32-related protein
anp32/acidic nuclear phosphoprotein-like protein
pf14_0257
[download]

Comment on how to parse a UniProt Flat file Select or Download Code

Replies are listed 'Best First'.
Re: how to parse a UniProt Flat file by kennethk (Abbot) on Dec 09, 2008 at 19:59 UTC
Your regex is using greedy matching, so your first match term is 'Name=ARF1; Synonyms'. You can make it less greedy using '+?'. however: This won't fix your problem because your format requires multiple passes per line, and you are only performing one. Perhaps something like this? `@lines = grep {/^DE\|^GN\|^ID/} split ("\n", $_); foreach $lines(@lines) { if ($lines =~ /^DE\|^GN/ && $lines !~ /Putative uncharacterized pro +tein/) { while ($lines) { $lines =~ s/.+?\=(.+?)\;//; print lc($1)."\n"; } } elsif ($lines =~ /^ID/) { print " \n"; } }` [download]	[reply] [d/l]
Re: how to parse a UniProt Flat file by toolic (Bishop) on Dec 09, 2008 at 20:18 UTC
Not very elegant, but it seems to grab what you want: `use strict; use warnings; while (<>) { if (/^DE\|^GN\|^ID/) { my $lines = $_; if ($lines =~ /^DE\|^GN/ && $lines !~ /Putative uncharacterized + protein/) { my @pairs = $lines =~ /(.+?=.+?;)/g; for my $pair (@pairs) { if ($pair =~ /=(.+);/) { print lc($1), "\n"; } } } elsif ($lines =~ /^ID/) { print " \n"; } } } __END__ adp-ribosylation factor 1 arf1 arf, plarf` [download]	[reply] [d/l]
Re: how to parse a UniProt Flat file by ig (Vicar) on Dec 09, 2008 at 22:28 UTC
the next thing I want is to avoid duplicates in my output file I don't know what duplicates you want to avoid. I guessed and came up with the following. If there are particular cases of duplicate that your are interested, or if you need to ignore the entire entry, you will have to be more specific. `use strict; use warnings; my %seen; while (<>) { if (/^DE\|^GN/) { next if (/Putative uncharacterized protien/); foreach (/=([^;]+);/g) { my $lc = lc($_); if ( $seen{$lc}++ > 0) { print "hey! we already saw $lc!!\n"; } else { print "$lc\n"; } } } elsif (/^ID/) { print "\n"; } }` [download]	[reply] [d/l]
Re^2: how to parse a UniProt Flat file by stanleysj (Novice) on Dec 10, 2008 at 09:03 UTC
Thanx a lot Ig.your code is exactly what i wanted. Whenever I am having duplicate lines I dont want it to get printed. What i want is only a single entry in the output.	[reply]
Re^2: how to parse a UniProt Flat file by stanleysj (Novice) on Dec 10, 2008 at 10:16 UTC
Hello IG I have got a new problem... I have got to make hash which has the following properties keys = the text grabbed from the first DE line i.e after DE RecName= ; values = all further text grabbed from other DE and GN lines i.e from DE AltName= ; GN Name= ; The sole purpose is to bring together all duplicate entries under one key...keys of the hash are unique but there could be multiple values for a key rite?	[reply]
Re^3: how to parse a UniProt Flat file by kennethk (Abbot) on Dec 10, 2008 at 17:47 UTC
In a hash, there is a one-to-one correspondence between keys and values. If you want to have a one-to-many mapping, you can set the value for a given key to an anonymous array, i.e. `%my_hash = (); $my_hash{key1} = []; $my_hash{key1}->[0] = 'value';` [download] Obviously, you can shorten that up. For for info on arrays, hashes, etc. check out perldata and, for the fancy stuff, perldsc.	[reply] [d/l]
Re^3: how to parse a UniProt Flat file by ig (Vicar) on Dec 10, 2008 at 22:03 UTC
A practical example of nested data structures may help. use strict; use warnings; use Data::Dumper; # # The keys of %entries are the Descriptions and Gene Names from all th +e entries. # The values are references to anonymous arrays, with each entry in th +e array # being a hash reference returned by read_entry(). If there are more t +han one # elements in the array, then there are duplicate uses of the Descript +ion or # Gene Name. # my %entries; # Populate the hash of entries while ( my $entry = read_entry() ) { foreach ( @{ $entry->{DE} }, @{ $entry->{GN} } ) { push( @{ $entries{$_} }, $entry); } if( @{ $entry->{DE} } == 0 and @{ $entry->{GN} } == 0 ) { print "No names for: " . Dumper($entry) . "\n"; } } # # Report all entries for each Description or Gene Name, noting those w +ith # duplicate entries associated. # foreach ( sort keys %entries ) { print "-------------------------\n"; print "Duplicate " if ( @{ $entries{$_} } > 1 ); print "Description or Gene Name: $_\n"; foreach ( @{ $entries{$_} } ) { local $" = ', '; print <<EOF; ID: $_->{ID} Accession Numbers: @{ $_->{AC} } Descriptions: @{ $_->{DE} } Gene Names: @{ $_->{GN} } EOF } print "\n"; } exit(0); # # read_entry() returns a hash reference representing the next entry in + the # file, or undef at end of file. # # Each entry has four keys: # # ID The IDentifier of the entry, as a string # # AC The ACcession numbers of the entry, as an anonymous array refe +rence # with each element of the array being one accession number # # DE The DEscriptions of the entry, as an anonymous array reference # with each element of the array being one description # # GN The Gene Names of the entry, as an anonymous array reference # with each element of the array being one gene name # sub read_entry { my $entry = { ID => 'This entry had no ID', AC => [], DE => [], GN => [], }; my $line = <>; $line = <> while( defined( $line) and $line !~ /^ID\s/ ); return(undef) unless(defined($line)); while(defined($line)) { if($line =~ m/^\/\//) { last; } elsif($line =~ m/^ID/) { if ($line =~ m/^ID\s+(\S+)/) { $entry->{ID} = $1; } else { error("malformed ID line: $line"); } } elsif ($line =~ m/^AC\s+(.)/) { my $accession_numbers; do { $accession_numbers .= $1; } while ( ($line = <>) =~ m/^AC\s+(.)/ ); $entry->{AC} = [ $accession_numbers =~ m/([^;]+);/g ]; next; } elsif ($line =~ m/^DE\s+(.)/) { my $description; do { $description .= $1; } while ( ($line = <>) =~ m/^DE\s+(.)/ ); $entry->{DE} = [ map { lc } $description =~ m/=([^;]+);/g +]; next; } elsif ($line =~ m/^GN\s+(.)/) { my $gene_names; do { $gene_names .= $1; } while ( ($line = <>) =~ m/^DE\s+(.)/ ); $entry->{GN} = [ map { lc } $gene_names =~ m/=([^;]+);/g ] +; next; } $line = <>; } return($entry); } [download]	[reply] [d/l]
Re: how to parse a UniProt Flat file by ig (Vicar) on Dec 11, 2008 at 00:23 UTC
If you can install Swissknife, you don't have to write your own parser for UniProt and a program similar to the following might do what you need. use strict; use warnings; use Data::Dumper; # # SWISS::Entry is part of Swissknife # Available from http://swissknife.sourceforge.net/ # See: http://swissknife.sourceforge.net/docs/ # use SWISS::Entry; my %entries; # Change the line termination string so we read an entire entry at a t +ime local $/ = "\n//\n"; # Read in all the entries and fill %entries while (<>) { my $entry = SWISS::Entry->fromText($_); # # Add this entry to %entries once for each IDentifier, DEscription # and Gene Name in the entry, all keys converted to lower case. # The hash values are pointers to anonymous arrays, so push the # entries onto the arrays. # foreach my $key ( $entry->IDs->elements, map { $_->text } $entry->DEs->elements, map { ( $_->Name, $_->Synonyms ) } $entry->GNs->elements , ) { push( @{$entries{lc($key)}}, $entry); } } # # Now report on each key in %entries # foreach my $key (sort keys %entries) { print "\n\n----------------------\n"; print "DUPLICATE " if ( @{$entries{$key}} > 1); print "key $key\n"; foreach my $entry ( @{$entries{$key}} ) { print "\n"; print " IDs " . join(", ", $entry->IDs->elements) . "\n" if($entry->IDs); print " DEs " . join(", ", map { $_->text } $entry->DEs->elements) . "\n" if($entry->DEs); print " GNs " . join(", ", map { $_->text } map { ($_->Name, $_->Synonyms) }$entry->GNs->elements) + . "\n" if($entry->GNs); } } [download]	[reply] [d/l]
Re^2: how to parse a UniProt Flat file by stanleysj (Novice) on Dec 30, 2008 at 10:11 UTC
Finally i have come with a code that could do my work of parsing an UniProt File nad getting the terms....Hope this code could be useful to others... Thanks to all who helped me out in this node. $/ = "//"; $count = 0; while ($chunkData = <>) { @data = grep {$_ !~ /^\s*$/} map {/.+?\=(.+?);/g} grep {$_ =~ /^DE +.+?\=(.+?);\|^GN.+?\=(.+?);/} split ("\n", $chunkData); foreach $term (@data) { next if ($term =~ /Putative uncharacterised protein/); if ($term =~ m/\,/g) { foreach (split (/\,\s/, $term)) { $hash{lc ($_)}++; next if $hash{lc ($_)} > 1; $count++; print "$count "; print lc($_)."\n"; } } elsif ($term =~ /(.+?)$(.+?)$/g) { $hash{lc ($1)}++; next if $hash{lc ($1)} >1; $count++; print "$count "; print lc($1)."\n"; $hash{lc ($2)}++; next if $hash{lc ($2)} >1; $count++; print "$count "; print lc($2)."\n"; } else { $hash{lc ($term)}++; next if $hash{lc ($term)} > 1; $count++; print "$count "; print lc ($term)."\n"; } } print "\n"; } [download]	[reply] [d/l]