Mhm, you forgot to enclose your file in <code></code> tags, but never mind, I can see what it is from the page source.
What you have there is a file that is in the format of ">", the ID, a newline, and then several lines of sequence data. There might be modules for this sort of thing, as dHarry suggested, but from a quick look at them they either want to parse things after they've been broken up, or want to keep huge amounts of data in memory as they parse, so you may be better off just going one node at a time with a regexp.
Since '>' only happens at the beginning of a new ID/segment, we can break the file up into manageable parts by setting the terminator character to that before we start parsing, and then set it back after we're done if we need to do something else. Once we've broken it up into chunks, we have a much simpler problem of separating the first line out of several lines into one variable, and the rest into another.
Here's some sample code to get you started, loading and dealing with only one chunk at a time:It outputs:#!/usr/bin/perl # # Parse simple FASTA text, chunk by chunk use strict; use warnings; my $term = $/; my $fastafile = 'fasta.txt'; my $pos = 0; my $id; my @sequencelines; my $sequence; my $line; open(FASTA,"<",$fastafile) or die("Open failed: $!"); $/ = ">"; while(<FASTA>) { chomp; # Since the file begins with ">", the first extraction will # contain only that '>', which will then get chomped, so we'll # have a blank line to skip. next if($_ eq ''); ($id,@sequencelines) = split /\n/; # I'm not sure if the ID is supposed to include the '>' in front # of it or not, but if so, we can put it back. $id = '>' . $id; print "Found ID '",$id,"' at position ",$pos,":\n"; $sequence = ''; foreach $line (@sequencelines) { print $line,"\n"; $sequence .= $line; } print "\n"; $pos++; } $/ = $term;
Found ID '>ELKSMKO02JGD0L' at position 0: TCAGGAATCTAATACTCAAGCTGTGGCCTATCCAGTACAACATGTAGCGAGACAATAATATCTCAGGATC TGAATACACCCCTTCTGTTAAAATGCAGTCTAGGATTACACTAGCTTTGTTCACAGCCACGTAACACCAC TGACTCACATGAAGACTGAAGACAACACAACCCCCCACATCTTGTTCACAAAAACTGGTAGCATGCCAGG TCTTCCATATCTTTACAGGACACTTGGTATTTTACAAAACTTAATTC Found ID '>ELKSMKO02FEYZW' at position 1: TCAGTCATAATGTCATTTCTTCAAAACTTGATCTGTAGATTTAATGGAACCCCAATCAAAATTCCAGCAA ATTATTAAGTGGATATCCACAAACTGGTTCAAAAGTTTATATGAAATAACAAAAGATCCAGAACAGCCAA CATAATATTGAAGGAGAAGAATGAAGTTCGAGGTCTAACAAACTAATTTCTGTATGATTCCAACTACATG GCATTCTGGAAAATGAAAAATTACAGACACAGTAAATAGCTCAGTGATTGCCAGGTAGG Found ID '>ELKSMKO02IX3A4' at position 2: TCAGTCCCAACGTGCTGGGAGGGCGTGAGCCACGGTGCCCAGCCTTTTTATTTTTTATTTTTATTTTTAA TCTGTCTTGATTTTGCTTCCTTCCTAAACAGTTTTGGCTTCGTGATCACGTAAACCAAGAGTCACAAACT GAAATGCCATCAAGGGGCCAAGCAGGTAACAAAATTCAAGTCATACAGGTTCAATGTCTTAGTCACCCCA GGCTACAACAGAATATCATAGACTGGGTANCCTAATAATACAGATCATTTTCNCATGGTTCTAGAGGAC
Note that I kept the output readable by printing each part of the array one at a time, but if you need to deal with the entire sequence (i.e. because you're searching for a sequence containing a particular substring), then $sequence is also available for you to use until you leave the scope. Once that happens, the memory is reused, so you don't have to worry about holding 8GB of data in memory at once.
That what you need?
In reply to Re: Need to Parse 8 GB File
by AZed
in thread Need to Parse 8 GB File
by ashnator
| For: | Use: | ||
| & | & | ||
| < | < | ||
| > | > | ||
| [ | [ | ||
| ] | ] |