Creating hash from data extracted from text file in fasta format

reebee3 has asked for the wisdom of the Perl Monks concerning the following question:

I have a text file in fasta format, which looks like...

>SequenceID|1234_Gene1

ATGTGTGTGTGTGTGT (DNA Sequence)

>SequenceID|9876_Gene2

CTGTCGCTCGGAAAG (DNA Sequence)

I am trying to write a script that will pull this data from a txt file in the command line into a hash and then sort it by length. The hash would be set up as the SequenceID has the key and DNA sequence as the value.

This is what I have so far and it works, except I set up the hash inside the script and not from a txt file added in the command line, I am not sure how to do this. I do not want to use BioPerl.

my %sequences = (
    "SequenceID|1234_Gene1"  => 'CTTTTAAGCTGATTAGGCTTTTATACCATTAGATTTA
+GTAACTATTGTCTTTTAA',
    "SequenceID|9876_Gene2"  => 'GTGCTGTCTTAAGTTGAACAGAGTGTGGGAGGAAATA
+TAAGCAAAGTTATTCCGTAGAATT',
    "SequenceID|4567_Gene3" => 'CATCCTCCTTTACACCCCACAAACATTTGGCAACCCCT
+GATAGGTTTCTTTCTTGTGGA', 
    
);

foreach my $key (keys %sequences)
{
  my $len = length ($sequences{$key});
  print "$key:$len\n";
}
[download]

Thank you

Comment on Creating hash from data extracted from text file in fasta format Download Code

Replies are listed 'Best First'.

Re: Creating hash from data extracted from text file in fasta format
by Athanasius (Archbishop) on Oct 14, 2015 at 06:09 UTC

Hello reebee3,

Here’s one way to approach this task:

#! perl
use strict;
use warnings;

my (%seqs, $id, $dna);

while (my $line = <>)
{
    chomp $line;

    if ($line =~ / ^ > (.+) /x)
    {
        $seqs{$id} = $dna if defined $id;
        $id        = $1;
        $dna       = '';
    }
    else
    {
        $dna      .= $line;
    }
}

$seqs{$id} = $dna if defined $id;

for my $key (sort { length $seqs{$a} <=>
                    length $seqs{$b} } keys %seqs)
{
    printf "%s:%d\n", $key, length $seqs{$key};
}
[download]

Output:

15:55 >perl 1406_SoPW.pl data.fas
SequenceID|9876_Gene2:15
SequenceID|1234_Gene1:16

15:55 >
[download]

Notes:

The above code contains no error checking! In particular, it doesn’t check that the fasta file format is valid. You say “I do not want to use BioPerl”, but a dedicated module is usually better and safer than hand-written code.
The special filehandle <> reads from the file(s) specified on the command line (or from standard input if no files are specified). For other approaches, see perlopentut#Opening-Text-Files-for-Reading.
You say you want to sort the data by length, but you don’t specify the sort order. I have assumed increasing order. If you want decreasing order instead, reverse the occurrences of $a and $b: sort { length $seqs{$b} <=> length $seqs{$a} }

Hope that helps,

Athanasius <°(((>< contra mundum Iustus alius egestas vitae, eros Piratica,

[reply]
[d/l]
[select]

Re: Creating hash from data extracted from text file in fasta format
by kcott (Archbishop) on Oct 14, 2015 at 06:10 UTC

G'day reebee3,

To read lines from a file on the command line, you can just do this:

while (<>) {
    # $_ holds the line read - process it here
}
[download]

In your foreach loop, you can get each key in the sorted order you want by changing

... (keys ...
[download]

... (sort {length $sequences{$a} <=> length $sequences{$b} } keys ...
[download]

With this input file:

$ cat pm_1144803_fasta.txt
>SequenceID|1234_Gene1
CTTTTAAGCTGATTAGGCTTTTATACCATTAGATTTAGTAACTATTGTCTTTTAA
>SequenceID|9876_Gene2
GTGCTGTCTTAAGTTGAACAGAGTGTGGGAGGAAATATAAGCAAAGTTATTCCGTAGAATT
>SequenceID|4567_Gene3
CATCCTCCTTTACACCCCACAAACATTTGGCAACCCCTGATAGGTTTCTTTCTTGTGGA
[download]

and this script:

$ cat pm_1144803_fasta_seq_len_sort.pl
#!/usr/bin/env perl

use strict;
use warnings;

my (%sequences, $seq_key);

while (<>) {
    chomp;

    if (/^>/) {
        $seq_key = substr $_, 1;
    }
    else {
        $sequences{$seq_key} = $_;
    }
}

foreach my $key (
    sort {length $sequences{$a} <=> length $sequences{$b} } keys %sequ
+ences)
{
    my $len = length ($sequences{$key});
    print "$key:$len\n";
}
[download]

You can run this from the command line:

$ pm_1144803_fasta_seq_len_sort.pl pm_1144803_fasta.txt
SequenceID|1234_Gene1:55
SequenceID|4567_Gene3:59
SequenceID|9876_Gene2:61
[download]

— Ken

[reply]
[d/l]
[select]

Re: Creating hash from data extracted from text file in fasta format (QUIBBLE)
by ww (Archbishop) on Oct 14, 2015 at 12:25 UTC

Very decent answers above. So this is merely a quibble.

You say

"I do not want to use BioPerl."

Why? ... and why should we not treat such an observation as being of the same order as
"I do not want to use a module." or even,
"I want to dig a hole but I do not want to use a shovel." ?

Using a shovel is usually easier than digging with your bare hands. Using a module is usually easier than reinventing something with its capabilities ... and using BioPerl (seems to be) a well-tested and effective method of achieving your goal.

Spirit of the Monastery

[reply]
[d/l]