in reply to how to parse a UniProt Flat file

the next thing I want is to avoid duplicates in my output file

I don't know what duplicates you want to avoid. I guessed and came up with the following. If there are particular cases of duplicate that your are interested, or if you need to ignore the entire entry, you will have to be more specific.

use strict; use warnings; my %seen; while (<>) { if (/^DE|^GN/) { next if (/Putative uncharacterized protien/); foreach (/=([^;]+);/g) { my $lc = lc($_); if ( $seen{$lc}++ > 0) { print "hey! we already saw $lc!!\n"; } else { print "$lc\n"; } } } elsif (/^ID/) { print "\n"; } }

Replies are listed 'Best First'.
Re^2: how to parse a UniProt Flat file
by stanleysj (Novice) on Dec 10, 2008 at 09:03 UTC

    Thanx a lot Ig.your code is exactly what i wanted. Whenever I am having duplicate lines I dont want it to get printed. What i want is only a single entry in the output.

Re^2: how to parse a UniProt Flat file
by stanleysj (Novice) on Dec 10, 2008 at 10:16 UTC

    Hello IG

    I have got a new problem... I have got to make hash which has the following properties keys = the text grabbed from the first DE line i.e after
    DE RecName= ;
    values = all further text grabbed from other DE and GN lines i.e from DE AltName= ; GN Name= ;

    The sole purpose is to bring together all duplicate entries under one key...keys of the hash are unique but there could be multiple values for a key rite?

      In a hash, there is a one-to-one correspondence between keys and values. If you want to have a one-to-many mapping, you can set the value for a given key to an anonymous array, i.e.

      %my_hash = (); $my_hash{key1} = []; $my_hash{key1}->[0] = 'value';

      Obviously, you can shorten that up. For for info on arrays, hashes, etc. check out perldata and, for the fancy stuff, perldsc.

      A practical example of nested data structures may help.

      use strict; use warnings; use Data::Dumper; # # The keys of %entries are the Descriptions and Gene Names from all th +e entries. # The values are references to anonymous arrays, with each entry in th +e array # being a hash reference returned by read_entry(). If there are more t +han one # elements in the array, then there are duplicate uses of the Descript +ion or # Gene Name. # my %entries; # Populate the hash of entries while ( my $entry = read_entry() ) { foreach ( @{ $entry->{DE} }, @{ $entry->{GN} } ) { push( @{ $entries{$_} }, $entry); } if( @{ $entry->{DE} } == 0 and @{ $entry->{GN} } == 0 ) { print "No names for: " . Dumper($entry) . "\n"; } } # # Report all entries for each Description or Gene Name, noting those w +ith # duplicate entries associated. # foreach ( sort keys %entries ) { print "-------------------------\n"; print "Duplicate " if ( @{ $entries{$_} } > 1 ); print "Description or Gene Name: $_\n"; foreach ( @{ $entries{$_} } ) { local $" = ', '; print <<EOF; ID: $_->{ID} Accession Numbers: @{ $_->{AC} } Descriptions: @{ $_->{DE} } Gene Names: @{ $_->{GN} } EOF } print "\n"; } exit(0); # # read_entry() returns a hash reference representing the next entry in + the # file, or undef at end of file. # # Each entry has four keys: # # ID The IDentifier of the entry, as a string # # AC The ACcession numbers of the entry, as an anonymous array refe +rence # with each element of the array being one accession number # # DE The DEscriptions of the entry, as an anonymous array reference # with each element of the array being one description # # GN The Gene Names of the entry, as an anonymous array reference # with each element of the array being one gene name # sub read_entry { my $entry = { ID => 'This entry had no ID', AC => [], DE => [], GN => [], }; my $line = <>; $line = <> while( defined( $line) and $line !~ /^ID\s/ ); return(undef) unless(defined($line)); while(defined($line)) { if($line =~ m/^\/\//) { last; } elsif($line =~ m/^ID/) { if ($line =~ m/^ID\s+(\S+)/) { $entry->{ID} = $1; } else { error("malformed ID line: $line"); } } elsif ($line =~ m/^AC\s+(.*)/) { my $accession_numbers; do { $accession_numbers .= $1; } while ( ($line = <>) =~ m/^AC\s+(.*)/ ); $entry->{AC} = [ $accession_numbers =~ m/([^;]+);/g ]; next; } elsif ($line =~ m/^DE\s+(.*)/) { my $description; do { $description .= $1; } while ( ($line = <>) =~ m/^DE\s+(.*)/ ); $entry->{DE} = [ map { lc } $description =~ m/=([^;]+);/g +]; next; } elsif ($line =~ m/^GN\s+(.*)/) { my $gene_names; do { $gene_names .= $1; } while ( ($line = <>) =~ m/^DE\s+(.*)/ ); $entry->{GN} = [ map { lc } $gene_names =~ m/=([^;]+);/g ] +; next; } $line = <>; } return($entry); }