Mhm, you forgot to enclose your file in <code></code> tags, but never mind, I can see what it is from the page source.
What you have there is a file that is in the format of ">", the ID, a newline, and then several lines of sequence data. There might be modules for this sort of thing, as dHarry suggested, but from a quick look at them they either want to parse things after they've been broken up, or want to keep huge amounts of data in memory as they parse, so you may be better off just going one node at a time with a regexp.
Since '>' only happens at the beginning of a new ID/segment, we can break the file up into manageable parts by setting the terminator character to that before we start parsing, and then set it back after we're done if we need to do something else. Once we've broken it up into chunks, we have a much simpler problem of separating the first line out of several lines into one variable, and the rest into another.
Here's some sample code to get you started, loading and dealing with only one chunk at a time:
#!/usr/bin/perl
#
# Parse simple FASTA text, chunk by chunk
use strict;
use warnings;
my $term = $/;
my $fastafile = 'fasta.txt';
my $pos = 0;
my $id;
my @sequencelines;
my $sequence;
my $line;
open(FASTA,"<",$fastafile) or die("Open failed: $!");
$/ = ">";
while(<FASTA>)
{
chomp;
# Since the file begins with ">", the first extraction will
# contain only that '>', which will then get chomped, so we'll
# have a blank line to skip.
next if($_ eq '');
($id,@sequencelines) = split /\n/;
# I'm not sure if the ID is supposed to include the '>' in front
# of it or not, but if so, we can put it back.
$id = '>' . $id;
print "Found ID '",$id,"' at position ",$pos,":\n";
$sequence = '';
foreach $line (@sequencelines)
{
print $line,"\n";
$sequence .= $line;
}
print "\n";
$pos++;
}
$/ = $term;
It outputs:
Found ID '>ELKSMKO02JGD0L' at position 0:
TCAGGAATCTAATACTCAAGCTGTGGCCTATCCAGTACAACATGTAGCGAGACAATAATATCTCAGGATC
TGAATACACCCCTTCTGTTAAAATGCAGTCTAGGATTACACTAGCTTTGTTCACAGCCACGTAACACCAC
TGACTCACATGAAGACTGAAGACAACACAACCCCCCACATCTTGTTCACAAAAACTGGTAGCATGCCAGG
TCTTCCATATCTTTACAGGACACTTGGTATTTTACAAAACTTAATTC
Found ID '>ELKSMKO02FEYZW' at position 1:
TCAGTCATAATGTCATTTCTTCAAAACTTGATCTGTAGATTTAATGGAACCCCAATCAAAATTCCAGCAA
ATTATTAAGTGGATATCCACAAACTGGTTCAAAAGTTTATATGAAATAACAAAAGATCCAGAACAGCCAA
CATAATATTGAAGGAGAAGAATGAAGTTCGAGGTCTAACAAACTAATTTCTGTATGATTCCAACTACATG
GCATTCTGGAAAATGAAAAATTACAGACACAGTAAATAGCTCAGTGATTGCCAGGTAGG
Found ID '>ELKSMKO02IX3A4' at position 2:
TCAGTCCCAACGTGCTGGGAGGGCGTGAGCCACGGTGCCCAGCCTTTTTATTTTTTATTTTTATTTTTAA
TCTGTCTTGATTTTGCTTCCTTCCTAAACAGTTTTGGCTTCGTGATCACGTAAACCAAGAGTCACAAACT
GAAATGCCATCAAGGGGCCAAGCAGGTAACAAAATTCAAGTCATACAGGTTCAATGTCTTAGTCACCCCA
GGCTACAACAGAATATCATAGACTGGGTANCCTAATAATACAGATCATTTTCNCATGGTTCTAGAGGAC
Note that I kept the output readable by printing each part of the array one at a time, but if you need to deal with the entire sequence (i.e. because you're searching for a sequence containing a particular substring), then $sequence is also available for you to use until you leave the scope. Once that happens, the memory is reused, so you don't have to worry about holding 8GB of data in memory at once.
That what you need?
|