in reply to Re: character-by-character in a huge file
in thread character-by-character in a huge file
A quick aside: as you noted, it's been suggested that I use bioperl to tackle my problem. I use bioperl nearly every day in many various ways...when it's tools fit my need. In this case, bioperl offers nothing relevant (fasta format is trivial, and there's no facility in bioperl to "read character by character, really quickly").
Sadly, the optimizations you've suggested don't work nearly as well for me as they do for you. The first part of the benchmark code (below) deals with this problem. For simplicity, I've removed the bit about sliding window. It's something I need to deal with, but it's not really at the heart of my problem. In this test, I just check to see how long it takes to get access to every character.
The heart of the problem is that I need to read the value of each character in the file one-at-a-time (the "why" is directly related to the sliding window: I can either recalculate the character counts of the window each time I slide over one space, or I can just consider the character being left behind by the slide, and the new character being added). But accessing each individual character from a file is dreadfully slower in perl than just slurping the file off disk. This is shown with the second part of the benchmark:
... (running Redhat Linux 9.0. perl 5.8.0, testing with a 2MB file)
#!/usr/bin/perl -sl use strict; use Benchmark qw|:all|; my $ch; my $filename = "ciona_fasta_two"; $/ = ">"; cmpthese( 10, { getc => sub { open (FH, "<$filename"); until (eof(FH)) { $ch = getc(FH); } close FH; }, slurp_substr => sub { open (FH, "<$filename"); my $i = 0; while ( <FH>) { while ($ch = substr($_,$i++,1)){ } } close FH; }, raw_slurp_substr => sub { open (FH, '<:raw',"$filename"); my $i = 0; while ( <FH>) { while ($ch = substr($_,$i++,1)){ } } close FH; }, slurp_regex => sub { open (FH, "<$filename"); while ( <FH>){ while ( /(.)/g){ } #the char is in $1 } close FH; }, raw_sysread_onechar => sub { open (FH, '<:raw',"$filename"); while ( sysread( FH, $ch, 1 ) ) {} close FH; }, nonraw_sysread_onechar => sub { open (FH, '<:raw',"$filename"); while ( sysread( FH, $ch, 1 ) ) {} close FH; }, raw_sysread_buffer => sub { my ($read, $buf); open (FH, '< :raw', "$filename"); while ( $read = sysread( FH, $buf, 100*4096) ) {#faster than 1 +*4096 for ( 1 .. $read ) { $ch = substr( $buf, $_, 1 ); } } close FH; }, nonraw_sysread_buffer => sub { my ($read, $buf); open (FH, "<$filename"); while ( $read = sysread( FH, $buf, 100*4096) ) { for ( 1 .. $read ) { $ch = substr( $buf, $_, 1 ); } } close FH; }, } ); cmpthese( 10, { slurp_substr => sub { open (FH, "<$filename"); my $i = 0; while ( <FH>) { while ($ch = substr($_,$i++,1)){ } } close FH; }, slurp_simpleregex => sub { my $len=0; open (FH, "<$filename"); while ( <FH>){ $_ =~ /(.)$/; } close FH; }, slurp_length => sub { my $len=0; open (FH, "<$filename"); while ( <FH>){ $len += length($_); } close FH; }, } ); -----> RESULTS -----> s/iter raw_sysread_onechar nonraw_sysread_onechar +raw_slurp_substr getc nonraw_sysread_buffer slurp_regex raw_sysread_b +uffer slurp_substr raw_sysread_onechar 2.97 -- -0% -4% -19% -46% -51% -53% -70% nonraw_sysread_onechar 2.95 0% -- -3% -19% -46% -51% -52% -70% raw_slurp_substr 2.85 4% 4% -- -16% -44% -49% -51% -69 +% getc 2.40 24% 23% 19% -- -33% -40% -41% -63% + nonraw_sysread_buffer 1.60 86% 85% 79% 50% -- -10% -12% -45% slurp_regex 1.45 105% 104% 97% 66% 11% -- -3% -39% raw_sysread_buffer 1.41 111% 110% 103% 70% 13% 3% -- -37% slurp_substr 0.886 235% 234% 222% 171% 80% 63% 59% -- Rate slurp_substr slurp_length slurp..regex slurp_substr 1.14/s -- -96% -97% slurp_length 31.2/s 2634% -- -12% slurp_simpleregex 35.7/s 3025% 14% --
As you can see from the first test, adding the "raw" option doesn't help, and sysread generally slows things down doesn't seem to help at all. In general, the best performance I can get is by slurping a bunch of content with the standard $line=<FH> syntax, then indexing into the string using substr.
But as you can see from the 2nd benchmark, the performance of this option is actually pretty terrible. I can read in the entire contents of the file (or even test with regex, without assigning individual characters to an accesible variable) in 1/30th the time it takes to look at each character in the file. That's much worse than what I've seen (but haven't tested here) in C.
I still hold out hope that there's a faster option - and based on your complete treatment of the topic in your original post, hope you'll either a) see what I'm doing wrong, or b) come up with another speedup idea.
Thanks again, Travis
|
|---|
| Replies are listed 'Best First'. | |
|---|---|
|
Re: Re: character-by-character in a huge file
by BrowserUk (Patriarch) on Apr 13, 2004 at 12:14 UTC | |
by mushnik (Acolyte) on Apr 13, 2004 at 15:49 UTC | |
by BrowserUk (Patriarch) on Apr 13, 2004 at 16:29 UTC | |
by mushnik (Acolyte) on Apr 13, 2004 at 17:14 UTC |