krish28 has asked for the wisdom of the Perl Monks concerning the following question:

Hi,
I need the wisdom of the monks for a problem.
I am trying to read a multi-sequence fasta file and then put each of the sequences in an array, after which I work on each array to get some measures.
I want to do all this within one program, so i tried using an array of arrays to capture all the sequences from one "> sequence header" to the next ">sequence header".
while (<INFILE>) { chomp; push @dna,[split /^>(.*)/]; } close INFILE;

The input is a file with a lot of fasta sequences like
>sequence header 1.
AAATATTATATATATTGCG
ATTATTATATGCGCGGCGC
>sequence header 2
AATTGGGCTCGCTGCTTTT
AGGAGGAGGAGCCCTCTCC
>sequence header 3
AATTGGCTGCTCGCTGCTC
AATGTGTCGGCGCGCGTGC

I want each of the sequences in an array, or atleast in an array, within an array of arrays.

I would appreciate any helpful suggestions.

Thanks

Kris
  • Comment on Splitting a multi-sequence fasta file into individual sequences in individual arrays
  • Download Code

Replies are listed 'Best First'.
Re: Splitting a multi-sequence fasta file into individual sequences in individual arrays
by BrowserUk (Patriarch) on Feb 09, 2011 at 04:36 UTC

    Why do you want an AOAs? Each sequence is a header + a single, wrapped sequence. Wouldn't a hash be more useful?

    #! perl -slw use strict; use Data::Dump qw[ pp ]; my %seqs; { local $/ = ">"; my @seqs = <DATA>; chomp @seqs; s[\n][\t] for @seqs; tr[\n][]d for @seqs; shift @seqs; %seqs = map split( "\t" ), @seqs; } pp \%seqs; __DATA__ >sequence header 1. AAATATTATATATATTGCG ATTATTATATGCGCGGCGC >sequence header 2 AATTGGGCTCGCTGCTTTT AGGAGGAGGAGCCCTCTCC >sequence header 3 AATTGGCTGCTCGCTGCTC AATGTGTCGGCGCGCGTGC

    Prints

    [ 4:34:55.96] c:\test>junk40 { "sequence header 1." => "AAATATTATATATATTGCGATTATTATATGCGCGGCGC", "sequence header 2" => "AATTGGGCTCGCTGCTTTTAGGAGGAGGAGCCCTCTCC", "sequence header 3" => "AATTGGCTGCTCGCTGCTCAATGTGTCGGCGCGCGTGC", }

    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.
      Thanks for your reply, BrowserUk... I got the code working by using just the hash
      Have a good one
      Krishna.

        If you consult the FASTA format spec, all the lines in between headers will be the same length (except the last) and that count has no more significance than it is an arbitrary wrap point. All the lines between two header constitute as single sequence, not a set of sequences.


        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        "Science is about questioning the status quo. Questioning authority".
        In the absence of evidence, opinion is indistinguishable from prejudice.
Re: Splitting a multi-sequence fasta file into individual sequences in individual arrays
by umasuresh (Hermit) on Feb 09, 2011 at 18:35 UTC