ashnator has asked for the wisdom of the Perl Monks concerning the following question:

Hi guys, I need help desperately. I have got to parse 8GB file containing FASTA format sequence files. How can I do it with PERL ?? I have to extract only the unique Fasta headers along with sequence. Suppose for ">ELKSMKO02JGD0L" I also need to get the alphabetical string of it TCAGGAATCTAATACTCAAGCT..... I will be obliged if Monks can help me out... The problem gets complex bcoz its a huge 8GB file. The file looks like this:- >ELKSMKO02JGD0L TCAGGAATCTAATACTCAAGCTGTGGCCTATCCAGTACAACATGTAGCGAGACAATAATATCTCAGGATC TGAATACACCCCTTCTGTTAAAATGCAGTCTAGGATTACACTAGCTTTGTTCACAGCCACGTAACACCAC TGACTCACATGAAGACTGAAGACAACACAACCCCCCACATCTTGTTCACAAAAACTGGTAGCATGCCAGG TCTTCCATATCTTTACAGGACACTTGGTATTTTACAAAACTTAATTC >ELKSMKO02FEYZW TCAGTCATAATGTCATTTCTTCAAAACTTGATCTGTAGATTTAATGGAACCCCAATCAAAATTCCAGCAA ATTATTAAGTGGATATCCACAAACTGGTTCAAAAGTTTATATGAAATAACAAAAGATCCAGAACAGCCAA CATAATATTGAAGGAGAAGAATGAAGTTCGAGGTCTAACAAACTAATTTCTGTATGATTCCAACTACATG GCATTCTGGAAAATGAAAAATTACAGACACAGTAAATAGCTCAGTGATTGCCAGGTAGG >ELKSMKO02IX3A4 TCAGTCCCAACGTGCTGGGAGGGCGTGAGCCACGGTGCCCAGCCTTTTTATTTTTTATTTTTATTTTTAA TCTGTCTTGATTTTGCTTCCTTCCTAAACAGTTTTGGCTTCGTGATCACGTAAACCAAGAGTCACAAACT GAAATGCCATCAAGGGGCCAAGCAGGTAACAAAATTCAAGTCATACAGGTTCAATGTCTTAGTCACCCCA GGCTACAACAGAATATCATAGACTGGGTANCCTAATAATACAGATCATTTTCNCATGGTTCTAGAGGAC

Replies are listed 'Best First'.
Re: Need to Parse 8 GB File
by dHarry (Abbot) on Sep 09, 2008 at 06:44 UTC

    Take a look at FASTAParse or Bio::FastaStream there are several parsing module at CPAN to handle FASTA formatted sequences. You can also do a Super Search in the Monastery. It's not the first time that Fasta turns up at the Monastery. 8GB sounds a bit big however.

    HTH

Re: Need to Parse 8 GB File
by AZed (Monk) on Sep 10, 2008 at 01:27 UTC

    Mhm, you forgot to enclose your file in <code></code> tags, but never mind, I can see what it is from the page source.

    What you have there is a file that is in the format of ">", the ID, a newline, and then several lines of sequence data. There might be modules for this sort of thing, as dHarry suggested, but from a quick look at them they either want to parse things after they've been broken up, or want to keep huge amounts of data in memory as they parse, so you may be better off just going one node at a time with a regexp.

    Since '>' only happens at the beginning of a new ID/segment, we can break the file up into manageable parts by setting the terminator character to that before we start parsing, and then set it back after we're done if we need to do something else. Once we've broken it up into chunks, we have a much simpler problem of separating the first line out of several lines into one variable, and the rest into another.

    Here's some sample code to get you started, loading and dealing with only one chunk at a time:
    #!/usr/bin/perl # # Parse simple FASTA text, chunk by chunk use strict; use warnings; my $term = $/; my $fastafile = 'fasta.txt'; my $pos = 0; my $id; my @sequencelines; my $sequence; my $line; open(FASTA,"<",$fastafile) or die("Open failed: $!"); $/ = ">"; while(<FASTA>) { chomp; # Since the file begins with ">", the first extraction will # contain only that '>', which will then get chomped, so we'll # have a blank line to skip. next if($_ eq ''); ($id,@sequencelines) = split /\n/; # I'm not sure if the ID is supposed to include the '>' in front # of it or not, but if so, we can put it back. $id = '>' . $id; print "Found ID '",$id,"' at position ",$pos,":\n"; $sequence = ''; foreach $line (@sequencelines) { print $line,"\n"; $sequence .= $line; } print "\n"; $pos++; } $/ = $term;
    It outputs:
    Found ID '>ELKSMKO02JGD0L' at position 0: TCAGGAATCTAATACTCAAGCTGTGGCCTATCCAGTACAACATGTAGCGAGACAATAATATCTCAGGATC TGAATACACCCCTTCTGTTAAAATGCAGTCTAGGATTACACTAGCTTTGTTCACAGCCACGTAACACCAC TGACTCACATGAAGACTGAAGACAACACAACCCCCCACATCTTGTTCACAAAAACTGGTAGCATGCCAGG TCTTCCATATCTTTACAGGACACTTGGTATTTTACAAAACTTAATTC Found ID '>ELKSMKO02FEYZW' at position 1: TCAGTCATAATGTCATTTCTTCAAAACTTGATCTGTAGATTTAATGGAACCCCAATCAAAATTCCAGCAA ATTATTAAGTGGATATCCACAAACTGGTTCAAAAGTTTATATGAAATAACAAAAGATCCAGAACAGCCAA CATAATATTGAAGGAGAAGAATGAAGTTCGAGGTCTAACAAACTAATTTCTGTATGATTCCAACTACATG GCATTCTGGAAAATGAAAAATTACAGACACAGTAAATAGCTCAGTGATTGCCAGGTAGG Found ID '>ELKSMKO02IX3A4' at position 2: TCAGTCCCAACGTGCTGGGAGGGCGTGAGCCACGGTGCCCAGCCTTTTTATTTTTTATTTTTATTTTTAA TCTGTCTTGATTTTGCTTCCTTCCTAAACAGTTTTGGCTTCGTGATCACGTAAACCAAGAGTCACAAACT GAAATGCCATCAAGGGGCCAAGCAGGTAACAAAATTCAAGTCATACAGGTTCAATGTCTTAGTCACCCCA GGCTACAACAGAATATCATAGACTGGGTANCCTAATAATACAGATCATTTTCNCATGGTTCTAGAGGAC

    Note that I kept the output readable by printing each part of the array one at a time, but if you need to deal with the entire sequence (i.e. because you're searching for a sequence containing a particular substring), then $sequence is also available for you to use until you leave the scope. Once that happens, the memory is reused, so you don't have to worry about holding 8GB of data in memory at once.

    That what you need?

      Thanks buddy..