comment on

Mhm, you forgot to enclose your file in <code></code> tags, but never mind, I can see what it is from the page source.

What you have there is a file that is in the format of ">", the ID, a newline, and then several lines of sequence data. There might be modules for this sort of thing, as dHarry suggested, but from a quick look at them they either want to parse things after they've been broken up, or want to keep huge amounts of data in memory as they parse, so you may be better off just going one node at a time with a regexp.

Since '>' only happens at the beginning of a new ID/segment, we can break the file up into manageable parts by setting the terminator character to that before we start parsing, and then set it back after we're done if we need to do something else. Once we've broken it up into chunks, we have a much simpler problem of separating the first line out of several lines into one variable, and the rest into another.

Here's some sample code to get you started, loading and dealing with only one chunk at a time:

#!/usr/bin/perl
#
# Parse simple FASTA text, chunk by chunk

use strict;
use warnings;

my $term = $/;
my $fastafile = 'fasta.txt';

my $pos = 0;
my $id;
my @sequencelines;
my $sequence;
my $line;

open(FASTA,"<",$fastafile) or die("Open failed: $!");

$/ = ">";
while(<FASTA>)
{
    chomp;

    # Since the file begins with ">", the first extraction will
    # contain only that '>', which will then get chomped, so we'll
    # have a blank line to skip.
    next if($_ eq '');

    ($id,@sequencelines) = split /\n/;

    # I'm not sure if the ID is supposed to include the '>' in front
    # of it or not, but if so, we can put it back.
    $id = '>' . $id;

    print "Found ID '",$id,"' at position ",$pos,":\n";

    $sequence = '';
    foreach $line (@sequencelines)
    {
        print $line,"\n";
        $sequence .= $line;
    }
    print "\n";
    $pos++;
}
$/ = $term;
[download]

It outputs:

Found ID '>ELKSMKO02JGD0L' at position 0:
TCAGGAATCTAATACTCAAGCTGTGGCCTATCCAGTACAACATGTAGCGAGACAATAATATCTCAGGATC
TGAATACACCCCTTCTGTTAAAATGCAGTCTAGGATTACACTAGCTTTGTTCACAGCCACGTAACACCAC
TGACTCACATGAAGACTGAAGACAACACAACCCCCCACATCTTGTTCACAAAAACTGGTAGCATGCCAGG
TCTTCCATATCTTTACAGGACACTTGGTATTTTACAAAACTTAATTC

Found ID '>ELKSMKO02FEYZW' at position 1:
TCAGTCATAATGTCATTTCTTCAAAACTTGATCTGTAGATTTAATGGAACCCCAATCAAAATTCCAGCAA
ATTATTAAGTGGATATCCACAAACTGGTTCAAAAGTTTATATGAAATAACAAAAGATCCAGAACAGCCAA
CATAATATTGAAGGAGAAGAATGAAGTTCGAGGTCTAACAAACTAATTTCTGTATGATTCCAACTACATG
GCATTCTGGAAAATGAAAAATTACAGACACAGTAAATAGCTCAGTGATTGCCAGGTAGG

Found ID '>ELKSMKO02IX3A4' at position 2:
TCAGTCCCAACGTGCTGGGAGGGCGTGAGCCACGGTGCCCAGCCTTTTTATTTTTTATTTTTATTTTTAA
TCTGTCTTGATTTTGCTTCCTTCCTAAACAGTTTTGGCTTCGTGATCACGTAAACCAAGAGTCACAAACT
GAAATGCCATCAAGGGGCCAAGCAGGTAACAAAATTCAAGTCATACAGGTTCAATGTCTTAGTCACCCCA
GGCTACAACAGAATATCATAGACTGGGTANCCTAATAATACAGATCATTTTCNCATGGTTCTAGAGGAC
[download]

Note that I kept the output readable by printing each part of the array one at a time, but if you need to deal with the entire sequence (i.e. because you're searching for a sequence containing a particular substring), then $sequence is also available for you to use until you leave the scope. Once that happens, the memory is reused, so you don't have to worry about holding 8GB of data in memory at once.

That what you need?

In reply to Re: Need to Parse 8 GB File by AZed
in thread Need to Parse 8 GB File by ashnator

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.