in reply to fasta hash

It's usually more efficient to do it the other way round: First read the file with the IDs, store the IDs in a hash, and then go through the fasta file, and print each line if its ID appears in the hash. That way you have to store less data in memory.

Regarding your code: Use strict and warnings, and indent the code properly, for example 4 characters for each opening bracket. It actually makes code readable. See perlstyle.

@data=split(" ",$line); $fastahash{$fastaID}=$sequence;

This is almost certainly wrong: the hash key ($fastaID) doesn't depend on $line, so whatever it is, it's not the current ID. First assign to $fastaID, then use it as a hash key.

Replies are listed 'Best First'.
Re^2: fasta hash
by morio56 (Initiate) on Aug 26, 2011 at 13:51 UTC

    Thanks. But then what will be the value to the id key since te ids file only contain ids and nothing else?

      As moritz said, 1 (or ++) is a common choice of value, but it's not necessarily the best one. It doesn't usually matter these days for 100k elements, but it's better (thanks, Liz! ;) for memory size to do something like this:

      my ($undef); while (whatever...) { .... $hash{$key} = $undef; }

      This way each element points to the same $undef value. Otherwise, each element would point to a different copy of the value 1. That's a kind of "poor man's aliasing". For bonus points, you might look at Array::RefElem or Data::Alias.

        I have changed the code, but now my problem seems to be that I can only access the last line of the output outside the loops. I wonder if there's a way to store the variables inside the loop to be accessible outside. The code looks like this now.

        if(@ARGV < 3){ die "Not enough arguments\n"; } $sequence=""; $fastaID; open(FILE1,"$ARGV[0]") or die "No fasta file provided in command line: + $!\n"; while ($line=<FILE1>){ chomp($line); if ($line=~/^\s*$/){ next; }elsif ($line=~/^.*$/){ $fastaID=$line; $fastahash{$fastaID}=1; } } open(FILE2,"$ARGV[2]") or die "No fasta file provided in command line: + $!\n"; while($line2=<FILE2>){ chomp($line2); if ($line2=~/^>/){ @data=split(" ",$line2); $fasta=$data[1]; $sequence=""; }else{ $sequence.=$line2; } } if (exists $fastahash{$fasta}){ print "$fastaID\t $sequence\n"; } exit;

        And the output, which is just the last key value in the fastahash is

        2056360013 Musacgagchagshgashcgahcgacacsasasasacsacsasasacacaasc +assacsaascascascascac