stanleysj has asked for the wisdom of the Perl Monks concerning the following question:

Hi guys,

I am beginner in PERL and I need ur help in parsing a Uniprot text file and creating an output file of protein terms

My input file looks like this; (this is just a part of the file the real file is available at http://www.uniprot.org/uniprot/?query=organism%3a%22Plasmodium+falciparum+%5b5833%5d%22&force=yes&format=txt)

// ID ARF1_PLAFA Reviewed; 181 AA. AC Q94650; O02502; O02593; DT 15-JUL-1998, integrated into UniProtKB/Swiss-Prot. DT 23-JAN-2007, sequence version 3. DT 25-NOV-2008, entry version 52. DE RecName: Full=ADP-ribosylation factor 1; GN Name=ARF1; Synonyms=ARF, PLARF; OS Plasmodium falciparum. OC Eukaryota; Alveolata; Apicomplexa; Aconoidasida; Haemosporida; OC Plasmodium; Plasmodium (Laverania). OX NCBI_TaxID=5833; RN [1] RP NUCLEOTIDE SEQUENCE [GENOMIC DNA]. RC STRAIN=T9/96; TISSUE=Blood; RX MEDLINE=97112480; PubMed=8954160; RX DOI=10.1111/j.1432-1033.1996.0104r.x; RA Stafford W.H., Stockley R.W., Ludbrook S.B., Holder A.A.; RT "Isolation, expression and characterization of the gene for an AD +P- RT ribosylation factor from the human malaria parasite, Plasmodium RT falciparum."; RL Eur. J. Biochem. 242:104-113(1996). RN [2] RP NUCLEOTIDE SEQUENCE [MRNA]. RX MEDLINE=97237566; PubMed=9084044; DOI=10.1016/S0166-6851(96)02803 +-4; RA Truong R.M., Francis S.E., Chakrabarti D., Goldberg D.E.; RT "Cloning and characterization of Plasmodium falciparum ADP- RT ribosylation factor and factor-like genes."; RL Mol. Biochem. Parasitol. 84:247-253(1997). CC -!- FUNCTION: GTP-binding protein that functions as an allosteric CC activator of the cholera toxin catalytic subunit, an ADP- CC ribosyltransferase. Involved in protein trafficking; may modu +late CC vesicle budding and uncoating within the Golgi apparatus (By CC similarity). CC -!- SUBCELLULAR LOCATION: Golgi apparatus (By similarity). CC -!- SIMILARITY: Belongs to the small GTPase superfamily. Arf fami +ly. CC ----------------------------------------------------------------- +------ CC Copyrighted by the UniProt Consortium, see http://www.uniprot.org +/terms CC Distributed under the Creative Commons Attribution-NoDerivs Licen +se CC ----------------------------------------------------------------- +------ DR EMBL; Z80359; CAB02498.1; -; Genomic_DNA. DR EMBL; U57370; AAB63304.1; -; mRNA. DR HSSP; P32889; 1RRF. DR SMR; Q94650; 6-179. DR GO; GO:0005794; C:Golgi apparatus; IEA:UniProtKB-KW. DR GO; GO:0005525; F:GTP binding; IEA:InterPro. DR GO; GO:0015031; P:protein transport; IEA:UniProtKB-KW. DR GO; GO:0007264; P:small GTPase mediated signal transduction; IEA: +InterPro. DR GO; GO:0016192; P:vesicle-mediated transport; IEA:UniProtKB-KW. DR InterPro; IPR006688; ARF. DR InterPro; IPR006689; ARF/SAR. DR InterPro; IPR001806; Ras_trnsfrmng. DR InterPro; IPR005225; Small_GTP_bd. DR PANTHER; PTHR11711; ARF/SAR; 1. DR Pfam; PF00025; Arf; 1. DR PRINTS; PR00449; RASTRNSFRMNG. DR PRINTS; PR00328; SAR1GTPBP. DR SMART; SM00177; ARF; 1. DR TIGRFAMs; TIGR00231; small_GTP; 1. DR PROSITE; PS01019; ARF; 1. PE 2: Evidence at transcript level; KW ER-Golgi transport; Golgi apparatus; GTP-binding; Lipoprotein; KW Myristate; Nucleotide-binding; Protein transport; Transport. FT INIT_MET 1 1 Removed (Potential). FT CHAIN 2 181 ADP-ribosylation factor 1. FT /FTId=PRO_0000207447. FT NP_BIND 24 31 GTP (By similarity). FT NP_BIND 67 71 GTP (By similarity). FT NP_BIND 126 129 GTP (By similarity). FT LIPID 2 2 N-myristoyl glycine (Potential). SQ SEQUENCE 181 AA; 20912 MW; 18013B069BEA2413 CRC64; MGLYVSRLFN RLFQKKDVRI LMVGLDAAGK TTILYKVKLG EVVTTIPTIG FNVETVEFRN ISFTVWDVGG QDKIRPLWRH YYSNTDGLIF VVDSNDRERI DDAREELHRM INEEELKDAI ILVFANKQDL PNAMSAAEVT EKLHLNTIRE RNWFIQSTCA TRGDGLYEGF DWLTTHLNNA K //

I am interested in lines starting with DE and GN and I want the text between = and
; The input line separator for each entry will be //. After that details of a new text is obtained. I want my out put to look like this

ADP-ribosylation factor 1 ARF1 ARF, PLARF

I have written a small code for it.pls have look at

while (<>) { @lines = grep {/^DE|^GN|^ID/} split ("\n", $_); foreach $lines(@lines) { if ($lines =~ /^DE|^GN/ && $lines !~ /Putative uncharacterized + protein/) { $lines =~ /.+\=(.+)\;/; print lc($1)."\n"; } elsif ($lines =~ /^ID/) { print " \n"; } } }

my problem is that my code does not grab the text between = and ; in the line starting with GN ...especially the one after Names= ;

the next thing I want is to avoid duplicates in my output file ..I have tried many commonly used codes mentioned in the various other posts over here. but it did not work for me.

my final output file should be like this </p> 101 kda malaria antigen p101 acidic basic repeat antigen pfl1385c actin-1 actin i pfl2215w actin-2 actin ii pf14_0124 fructose-bisphosphate aldolase 4.1.2.13 pf14_0425 acidic leucine-rich nuclear phosphoprotein 32-related protein anp32/acidic nuclear phosphoprotein-like protein pf14_0257

Replies are listed 'Best First'.
Re: how to parse a UniProt Flat file
by kennethk (Abbot) on Dec 09, 2008 at 19:59 UTC

    Your regex is using greedy matching, so your first match term is 'Name=ARF1; Synonyms'. You can make it less greedy using '+?'.

    however:

    This won't fix your problem because your format requires multiple passes per line, and you are only performing one. Perhaps something like this?

    @lines = grep {/^DE|^GN|^ID/} split ("\n", $_); foreach $lines(@lines) { if ($lines =~ /^DE|^GN/ && $lines !~ /Putative uncharacterized pro +tein/) { while ($lines) { $lines =~ s/.+?\=(.+?)\;//; print lc($1)."\n"; } } elsif ($lines =~ /^ID/) { print " \n"; } }
Re: how to parse a UniProt Flat file
by toolic (Bishop) on Dec 09, 2008 at 20:18 UTC
    Not very elegant, but it seems to grab what you want:
    use strict; use warnings; while (<>) { if (/^DE|^GN|^ID/) { my $lines = $_; if ($lines =~ /^DE|^GN/ && $lines !~ /Putative uncharacterized + protein/) { my @pairs = $lines =~ /(.+?=.+?;)/g; for my $pair (@pairs) { if ($pair =~ /=(.+);/) { print lc($1), "\n"; } } } elsif ($lines =~ /^ID/) { print " \n"; } } } __END__ adp-ribosylation factor 1 arf1 arf, plarf
Re: how to parse a UniProt Flat file
by ig (Vicar) on Dec 09, 2008 at 22:28 UTC
    the next thing I want is to avoid duplicates in my output file

    I don't know what duplicates you want to avoid. I guessed and came up with the following. If there are particular cases of duplicate that your are interested, or if you need to ignore the entire entry, you will have to be more specific.

    use strict; use warnings; my %seen; while (<>) { if (/^DE|^GN/) { next if (/Putative uncharacterized protien/); foreach (/=([^;]+);/g) { my $lc = lc($_); if ( $seen{$lc}++ > 0) { print "hey! we already saw $lc!!\n"; } else { print "$lc\n"; } } } elsif (/^ID/) { print "\n"; } }

      Thanx a lot Ig.your code is exactly what i wanted. Whenever I am having duplicate lines I dont want it to get printed. What i want is only a single entry in the output.


      Hello IG

      I have got a new problem... I have got to make hash which has the following properties keys = the text grabbed from the first DE line i.e after
      DE RecName= ;
      values = all further text grabbed from other DE and GN lines i.e from DE AltName= ; GN Name= ;

      The sole purpose is to bring together all duplicate entries under one key...keys of the hash are unique but there could be multiple values for a key rite?

        In a hash, there is a one-to-one correspondence between keys and values. If you want to have a one-to-many mapping, you can set the value for a given key to an anonymous array, i.e.

        %my_hash = (); $my_hash{key1} = []; $my_hash{key1}->[0] = 'value';

        Obviously, you can shorten that up. For for info on arrays, hashes, etc. check out perldata and, for the fancy stuff, perldsc.

        A practical example of nested data structures may help.

        use strict; use warnings; use Data::Dumper; # # The keys of %entries are the Descriptions and Gene Names from all th +e entries. # The values are references to anonymous arrays, with each entry in th +e array # being a hash reference returned by read_entry(). If there are more t +han one # elements in the array, then there are duplicate uses of the Descript +ion or # Gene Name. # my %entries; # Populate the hash of entries while ( my $entry = read_entry() ) { foreach ( @{ $entry->{DE} }, @{ $entry->{GN} } ) { push( @{ $entries{$_} }, $entry); } if( @{ $entry->{DE} } == 0 and @{ $entry->{GN} } == 0 ) { print "No names for: " . Dumper($entry) . "\n"; } } # # Report all entries for each Description or Gene Name, noting those w +ith # duplicate entries associated. # foreach ( sort keys %entries ) { print "-------------------------\n"; print "Duplicate " if ( @{ $entries{$_} } > 1 ); print "Description or Gene Name: $_\n"; foreach ( @{ $entries{$_} } ) { local $" = ', '; print <<EOF; ID: $_->{ID} Accession Numbers: @{ $_->{AC} } Descriptions: @{ $_->{DE} } Gene Names: @{ $_->{GN} } EOF } print "\n"; } exit(0); # # read_entry() returns a hash reference representing the next entry in + the # file, or undef at end of file. # # Each entry has four keys: # # ID The IDentifier of the entry, as a string # # AC The ACcession numbers of the entry, as an anonymous array refe +rence # with each element of the array being one accession number # # DE The DEscriptions of the entry, as an anonymous array reference # with each element of the array being one description # # GN The Gene Names of the entry, as an anonymous array reference # with each element of the array being one gene name # sub read_entry { my $entry = { ID => 'This entry had no ID', AC => [], DE => [], GN => [], }; my $line = <>; $line = <> while( defined( $line) and $line !~ /^ID\s/ ); return(undef) unless(defined($line)); while(defined($line)) { if($line =~ m/^\/\//) { last; } elsif($line =~ m/^ID/) { if ($line =~ m/^ID\s+(\S+)/) { $entry->{ID} = $1; } else { error("malformed ID line: $line"); } } elsif ($line =~ m/^AC\s+(.*)/) { my $accession_numbers; do { $accession_numbers .= $1; } while ( ($line = <>) =~ m/^AC\s+(.*)/ ); $entry->{AC} = [ $accession_numbers =~ m/([^;]+);/g ]; next; } elsif ($line =~ m/^DE\s+(.*)/) { my $description; do { $description .= $1; } while ( ($line = <>) =~ m/^DE\s+(.*)/ ); $entry->{DE} = [ map { lc } $description =~ m/=([^;]+);/g +]; next; } elsif ($line =~ m/^GN\s+(.*)/) { my $gene_names; do { $gene_names .= $1; } while ( ($line = <>) =~ m/^DE\s+(.*)/ ); $entry->{GN} = [ map { lc } $gene_names =~ m/=([^;]+);/g ] +; next; } $line = <>; } return($entry); }
Re: how to parse a UniProt Flat file
by ig (Vicar) on Dec 11, 2008 at 00:23 UTC

    If you can install Swissknife, you don't have to write your own parser for UniProt and a program similar to the following might do what you need.

    use strict; use warnings; use Data::Dumper; # # SWISS::Entry is part of Swissknife # Available from http://swissknife.sourceforge.net/ # See: http://swissknife.sourceforge.net/docs/ # use SWISS::Entry; my %entries; # Change the line termination string so we read an entire entry at a t +ime local $/ = "\n//\n"; # Read in all the entries and fill %entries while (<>) { my $entry = SWISS::Entry->fromText($_); # # Add this entry to %entries once for each IDentifier, DEscription # and Gene Name in the entry, all keys converted to lower case. # The hash values are pointers to anonymous arrays, so push the # entries onto the arrays. # foreach my $key ( $entry->IDs->elements, map { $_->text } $entry->DEs->elements, map { ( $_->Name, $_->Synonyms ) } $entry->GNs->elements , ) { push( @{$entries{lc($key)}}, $entry); } } # # Now report on each key in %entries # foreach my $key (sort keys %entries) { print "\n\n----------------------\n"; print "DUPLICATE " if ( @{$entries{$key}} > 1); print "key $key\n"; foreach my $entry ( @{$entries{$key}} ) { print "\n"; print " IDs " . join(", ", $entry->IDs->elements) . "\n" if($entry->IDs); print " DEs " . join(", ", map { $_->text } $entry->DEs->elements) . "\n" if($entry->DEs); print " GNs " . join(", ", map { $_->text } map { ($_->Name, $_->Synonyms) }$entry->GNs->elements) + . "\n" if($entry->GNs); } }

      Finally i have come with a code that could do my work of parsing an UniProt File nad getting the terms....Hope this code could be useful to others... Thanks to all who helped me out in this node.

      $/ = "//"; $count = 0; while ($chunkData = <>) { @data = grep {$_ !~ /^\s*$/} map {/.+?\=(.+?);/g} grep {$_ =~ /^DE +.+?\=(.+?);|^GN.+?\=(.+?);/} split ("\n", $chunkData); foreach $term (@data) { next if ($term =~ /Putative uncharacterised protein/); if ($term =~ m/\,/g) { foreach (split (/\,\s/, $term)) { $hash{lc ($_)}++; next if $hash{lc ($_)} > 1; $count++; print "$count "; print lc($_)."\n"; } } elsif ($term =~ /(.+?)\((.+?)\)/g) { $hash{lc ($1)}++; next if $hash{lc ($1)} >1; $count++; print "$count "; print lc($1)."\n"; $hash{lc ($2)}++; next if $hash{lc ($2)} >1; $count++; print "$count "; print lc($2)."\n"; } else { $hash{lc ($term)}++; next if $hash{lc ($term)} > 1; $count++; print "$count "; print lc ($term)."\n"; } } print "\n"; }