Thanks for your thorough consideration of my problem. More detail than I expected to receive...which is a good thing.

A quick aside: as you noted, it's been suggested that I use bioperl to tackle my problem. I use bioperl nearly every day in many various ways...when it's tools fit my need. In this case, bioperl offers nothing relevant (fasta format is trivial, and there's no facility in bioperl to "read character by character, really quickly").

Sadly, the optimizations you've suggested don't work nearly as well for me as they do for you. The first part of the benchmark code (below) deals with this problem. For simplicity, I've removed the bit about sliding window. It's something I need to deal with, but it's not really at the heart of my problem. In this test, I just check to see how long it takes to get access to every character.

The heart of the problem is that I need to read the value of each character in the file one-at-a-time (the "why" is directly related to the sliding window: I can either recalculate the character counts of the window each time I slide over one space, or I can just consider the character being left behind by the slide, and the new character being added). But accessing each individual character from a file is dreadfully slower in perl than just slurping the file off disk. This is shown with the second part of the benchmark:

... (running Redhat Linux 9.0. perl 5.8.0, testing with a 2MB file)

#!/usr/bin/perl -sl use strict; use Benchmark qw|:all|; my $ch; my $filename = "ciona_fasta_two"; $/ = ">"; cmpthese( 10, { getc => sub { open (FH, "<$filename"); until (eof(FH)) { $ch = getc(FH); } close FH; }, slurp_substr => sub { open (FH, "<$filename"); my $i = 0; while ( <FH>) { while ($ch = substr($_,$i++,1)){ } } close FH; }, raw_slurp_substr => sub { open (FH, '<:raw',"$filename"); my $i = 0; while ( <FH>) { while ($ch = substr($_,$i++,1)){ } } close FH; }, slurp_regex => sub { open (FH, "<$filename"); while ( <FH>){ while ( /(.)/g){ } #the char is in $1 } close FH; }, raw_sysread_onechar => sub { open (FH, '<:raw',"$filename"); while ( sysread( FH, $ch, 1 ) ) {} close FH; }, nonraw_sysread_onechar => sub { open (FH, '<:raw',"$filename"); while ( sysread( FH, $ch, 1 ) ) {} close FH; }, raw_sysread_buffer => sub { my ($read, $buf); open (FH, '< :raw', "$filename"); while ( $read = sysread( FH, $buf, 100*4096) ) {#faster than 1 +*4096 for ( 1 .. $read ) { $ch = substr( $buf, $_, 1 ); } } close FH; }, nonraw_sysread_buffer => sub { my ($read, $buf); open (FH, "<$filename"); while ( $read = sysread( FH, $buf, 100*4096) ) { for ( 1 .. $read ) { $ch = substr( $buf, $_, 1 ); } } close FH; }, } ); cmpthese( 10, { slurp_substr => sub { open (FH, "<$filename"); my $i = 0; while ( <FH>) { while ($ch = substr($_,$i++,1)){ } } close FH; }, slurp_simpleregex => sub { my $len=0; open (FH, "<$filename"); while ( <FH>){ $_ =~ /(.)$/; } close FH; }, slurp_length => sub { my $len=0; open (FH, "<$filename"); while ( <FH>){ $len += length($_); } close FH; }, } ); -----> RESULTS -----> s/iter raw_sysread_onechar nonraw_sysread_onechar +raw_slurp_substr getc nonraw_sysread_buffer slurp_regex raw_sysread_b +uffer slurp_substr raw_sysread_onechar 2.97 -- -0% -4% -19% -46% -51% -53% -70% nonraw_sysread_onechar 2.95 0% -- -3% -19% -46% -51% -52% -70% raw_slurp_substr 2.85 4% 4% -- -16% -44% -49% -51% -69 +% getc 2.40 24% 23% 19% -- -33% -40% -41% -63% + nonraw_sysread_buffer 1.60 86% 85% 79% 50% -- -10% -12% -45% slurp_regex 1.45 105% 104% 97% 66% 11% -- -3% -39% raw_sysread_buffer 1.41 111% 110% 103% 70% 13% 3% -- -37% slurp_substr 0.886 235% 234% 222% 171% 80% 63% 59% -- Rate slurp_substr slurp_length slurp..regex slurp_substr 1.14/s -- -96% -97% slurp_length 31.2/s 2634% -- -12% slurp_simpleregex 35.7/s 3025% 14% --

As you can see from the first test, adding the "raw" option doesn't help, and sysread generally slows things down doesn't seem to help at all. In general, the best performance I can get is by slurping a bunch of content with the standard $line=<FH> syntax, then indexing into the string using substr.

But as you can see from the 2nd benchmark, the performance of this option is actually pretty terrible. I can read in the entire contents of the file (or even test with regex, without assigning individual characters to an accesible variable) in 1/30th the time it takes to look at each character in the file. That's much worse than what I've seen (but haven't tested here) in C.

I still hold out hope that there's a faster option - and based on your complete treatment of the topic in your original post, hope you'll either a) see what I'm doing wrong, or b) come up with another speedup idea.

Thanks again, Travis


In reply to Re: character-by-character in a huge file by mushnik
in thread character-by-character in a huge file by mushnik

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.