I'm writing a script that makes one pass through a very large (up to 3 GB) file, and needs to keep track of the first and last characters in a sliding window of constant small size (e.g. 500).
Walking the filehandle with getc(FH) (actually, two handles - one trailing the other by 500 bytes) is pretty slow. Line by line is a lot faster if I don't try to look at individual characters...but then I've got to index into the line with substr, which slows things down even more. Reading each line into a character array with split is dog slow.
The format of the file I'm hitting allows me to grab very large chunks (10-50 thousand characters each) by setting $/=">", because ">" appears at regular, distant intervals. But indexing into the string captured from the file in that waywith substr is amazingly slow. Again, pushing the string to an array (where indexing is fast) with split is even worse.
Any suggestions about how to take advantage of the fact that Perl is perfectly fast at reading in large hunks of the file into a string, without having to index into that string with substr?
In short, can you beat the time I get reading a huge file by using:
open (FH, "<ciona_fasta_two");
until (eof(FH)) {
$ch = getc(FH);
if ($ch eq 'x'){
#do something;
}
}
close FH;
(It should be possible to beat it by *a lot*)Thanks, Travis
In reply to character-by-character in a huge file by mushnik
| For: | Use: | ||
| & | & | ||
| < | < | ||
| > | > | ||
| [ | [ | ||
| ] | ] |