in reply to Saving different values for the same key by using Hash of Arrays

You've chosen an effective use for a hash, but if you only need to find the longest (or longer or only) sequence for an ID that has one or more sequences, consider the following solution that doesn't use an array:

use strict; use warnings; my %FASTAhash; open my $file, '<FASTA.txt' or die $!; while (<$file>) { next if !/(>[^ ]+) /; chomp( $FASTAhash{$1} = $' ) if !$FASTAhash{$1} or length $' > length $FASTAhash{$1}; } close $file; print "$_ $FASTAhash{$_}\n" for keys %FASTAhash;

The regex matches the ID, which is placed into $1, leaving the remaining (unmatched) sequence in $'. The hash item whose key is the ID in $1 is assigned the sequence in $' and then chomped if that item's undefined (in this case) or the length of $' is greater than what's already there. When done, each ID is paired with its longest sequence. (Is it possible for two sequences of the same ID to be the same length? If so, do you need to code for that?)

Output from processing your data:

>ENSG00000147724 MSEIQGTVEFSVELHKFYNVDLFQRGYYQIRVTLKVSSRIPHRLSASIAGQTE +SSSLHSA CVHDSTVHSRVFQILYRNEEVPINDAVVFRVHLLLGGERMEDALSEVDFQLKVDLHFTDS +EQQLRDVAGAPMVSSRTLGLHFHPRNGLHHQVP >ENSG00000010072 MDDDLMLALRLQEEWNLQEAERDHAQESLSLVDASWELVDPTPDLQALFVQFN +DQFFWGQ LEAVEVKWSVRMTLCAGICSYEGKGGMCSIRLSEPLLKLRPRKDLVETLLHEMIHAYLFV +TNNDKDREGHGPEFCKHMHRINSLTGANITVYHTFHDEVDEYRRHWWRCNGPCQHRPPYY GYVKRATN +REPSAHDYWWAEHQKTCGGTYIKIKEPENYSKKGKGKAKLGKEPVLAAENKD KPNRGEAQLVIPFSGK +GYVLGETSNLPSPGKLITSHAINKTQDLLNQNHSANAVRPNSKI KVKFEQNGSSKNSHLVSPAVSNSH +QNVLSNYFPRVSFANQKAFRGVNGSPRISVTVGNIP KNSVSSSSQRRVSSSKISLRNSSKVTESASVM +PSQDVSGSEDTFPNKRPRLEDKTVFDNF FIKKEQIKSSGNDPKYSTTTAQNSSSSSSQSKMVNCPVCQ +NEVLESQINEHLDWCLEGDS IKVKSEESL* >ENSG00000067082 Sequence unavailable

Hope this helps!

Replies are listed 'Best First'.
Re^2: Saving different values for the same key by using Hash of Arrays
by beginner27 (Initiate) on May 07, 2012 at 09:53 UTC

    Thanks a lot for your quick and detailed answer..but the script doesn't actually print me anything! How is this possible?

      It's possible because my regex didn't work with your reformatted FASTA records. :) aaron_baugher's suggestion to repost your records using <code> or <pre> was spot on, and helped with crafting the following new-and-improved solution--after your re-posting:

      use strict; use warnings; my %FASTAhash; { local $/ = '>'; open my $file, '<FASTA.txt' or die $!; while (<$file>) { next if !/(.*?)\n/; chomp( $FASTAhash{$1} = $' ) if !$FASTAhash{$1} or length $' > length $FASTAhash{$1}; } } print ">$_\n$FASTAhash{$_}" for keys %FASTAhash;

      Within a block, we start by letting perl know that '>' is the new record separator, instead of the default "\n" (so we read the file a FASTA record at a time, instead of a line at a time), and then tweaked the regex a bit to grab the ID.

      You'll note that we don't use close $file; when we're done, since the file's automatically close when my $file falls out of scope (when the block ends).

      Here's the output:

      >ENSG00000147724 MSEIQGTVEFSVELHKFYNVDLFQRGYYQIRVTLKVSSRIPHRLSASIAGQTESSSLHSA CVHDSTVHSRVFQILYRNEEVPINDAVVFRVHLLLGGERMEDALSEVDFQLKVDLHFTDS EQQLRDVAGAPMVSSRTLGLHFHPRNGLHHQVP >ENSG00000067082 Sequence unavailable >ENSG00000010072 MDDDLMLALRLQEEWNLQEAERDHAQESLSLVDASWELVDPTPDLQALFVQFNDQFFWGQ LEAVEVKWSVRMTLCAGICSYEGKGGMCSIRLSEPLLKLRPRKDLVETLLHEMIHAYLFV TNNDKDREGHGPEFCKHMHRINSLTGANITVYHTFHDEVDEYRRHWWRCNGPCQHRPPYY GYVKRATNREPSAHDYWWAEHQKTCGGTYIKIKEPENYSKKGKGKAKLGKEPVLAAENKD KPNRGEAQLVIPFSGKGYVLGETSNLPSPGKLITSHAINKTQDLLNQNHSANAVRPNSKI KVKFEQNGSSKNSHLVSPAVSNSHQNVLSNYFPRVSFANQKAFRGVNGSPRISVTVGNIP KNSVSSSSQRRVSSSKISLRNSSKVTESASVMPSQDVSGSEDTFPNKRPRLEDKTVFDNF FIKKEQIKSSGNDPKYSTTTAQNSSSSSSQSKMVNCPVCQNEVLESQINEHLDWCLEGDS IKVKSEESL*

      Hope this version's helpful!

      Update: After posting the above, just noticed aaron_baugher's solution using $/ = '>' and I think this makes good sense, since this is the FASTA record delimiter.

        Both versions of the code work perfectly! Thank you a lot guys, your help has been invaluable!!!

        I hope that time will make me more confident with Perl so that one day I can too be useful to someone in need..