Re: Bioinformatic task

Define 'big'. One way to handle the problem for smaller values of 'big' is to read the entire file into memory then use a regular expression to cut it up for you.

For larger values of big you may be able to use something like:

use strict;
use warnings;

local $/ = ">Seq";

while (<DATA>) {
    chomp;
    next if ! length;
    print "Record: $_\n";
}

__DATA__
>Seq1
AAATTTGGG.....
>Seq2
AGATTTACC.....
[download]

True laziness is hard work

Comment on Re: Bioinformatic task Download Code

Replies are listed 'Best First'.
Re^2: Bioinformatic task by uvnew (Acolyte) on Nov 07, 2010 at 23:18 UTC
Thanks for replying. I actually wrongly used the word 'big'. I must read the whole file into memory, where one structure would be for the headers and another one for the sequences. My computer definitely has enough RAM for that.	[reply]
Re^3: Bioinformatic task by GrandFather (Saint) on Nov 07, 2010 at 23:41 UTC
I strongly second what aquarium implies in Re: Bioinformatic task - using a single structure containing records with two parts is much better than trying to keep two parallel arrays in sync. However, it's probably worth your while to tell us more about the problem you are trying to solve. I suspect there are other areas where you could use a little help with this problem! True laziness is hard work	[reply]
Re^2: Bioinformatic task by patcat88 (Deacon) on Nov 09, 2010 at 03:59 UTC
For the OP's purpose, dont use a regular expression just to cut up a string with a static delimiter. Regular expressions aren't the answer for every last parsing problem in Perl. They are 10x slower, no matter how basic, than a couple line index and substr algorithm. Use substr and index (see my post here Re: Is Using Threads Slower Than Not Using Threads?), or as you showed, by redefining input record separator and reading "by line".	[reply]