in reply to character-by-character in a huge file

Please see Optimising processing for large data files. for some techniques for reducing your runtimes by 96%, if your application doesn't lend itself to using Bioperl libraries.


Examine what is said, not who speaks.
"Efficiency is intelligent laziness." -David Dunham
"Think for yourself!" - Abigail
  • Comment on Re: character-by-character in a huge file

Replies are listed 'Best First'.
Re: character-by-character in a huge file
by mushnik (Acolyte) on Apr 13, 2004 at 05:59 UTC
    Thanks for your thorough consideration of my problem. More detail than I expected to receive...which is a good thing.

    A quick aside: as you noted, it's been suggested that I use bioperl to tackle my problem. I use bioperl nearly every day in many various ways...when it's tools fit my need. In this case, bioperl offers nothing relevant (fasta format is trivial, and there's no facility in bioperl to "read character by character, really quickly").

    Sadly, the optimizations you've suggested don't work nearly as well for me as they do for you. The first part of the benchmark code (below) deals with this problem. For simplicity, I've removed the bit about sliding window. It's something I need to deal with, but it's not really at the heart of my problem. In this test, I just check to see how long it takes to get access to every character.

    The heart of the problem is that I need to read the value of each character in the file one-at-a-time (the "why" is directly related to the sliding window: I can either recalculate the character counts of the window each time I slide over one space, or I can just consider the character being left behind by the slide, and the new character being added). But accessing each individual character from a file is dreadfully slower in perl than just slurping the file off disk. This is shown with the second part of the benchmark:

    ... (running Redhat Linux 9.0. perl 5.8.0, testing with a 2MB file)

    #!/usr/bin/perl -sl use strict; use Benchmark qw|:all|; my $ch; my $filename = "ciona_fasta_two"; $/ = ">"; cmpthese( 10, { getc => sub { open (FH, "<$filename"); until (eof(FH)) { $ch = getc(FH); } close FH; }, slurp_substr => sub { open (FH, "<$filename"); my $i = 0; while ( <FH>) { while ($ch = substr($_,$i++,1)){ } } close FH; }, raw_slurp_substr => sub { open (FH, '<:raw',"$filename"); my $i = 0; while ( <FH>) { while ($ch = substr($_,$i++,1)){ } } close FH; }, slurp_regex => sub { open (FH, "<$filename"); while ( <FH>){ while ( /(.)/g){ } #the char is in $1 } close FH; }, raw_sysread_onechar => sub { open (FH, '<:raw',"$filename"); while ( sysread( FH, $ch, 1 ) ) {} close FH; }, nonraw_sysread_onechar => sub { open (FH, '<:raw',"$filename"); while ( sysread( FH, $ch, 1 ) ) {} close FH; }, raw_sysread_buffer => sub { my ($read, $buf); open (FH, '< :raw', "$filename"); while ( $read = sysread( FH, $buf, 100*4096) ) {#faster than 1 +*4096 for ( 1 .. $read ) { $ch = substr( $buf, $_, 1 ); } } close FH; }, nonraw_sysread_buffer => sub { my ($read, $buf); open (FH, "<$filename"); while ( $read = sysread( FH, $buf, 100*4096) ) { for ( 1 .. $read ) { $ch = substr( $buf, $_, 1 ); } } close FH; }, } ); cmpthese( 10, { slurp_substr => sub { open (FH, "<$filename"); my $i = 0; while ( <FH>) { while ($ch = substr($_,$i++,1)){ } } close FH; }, slurp_simpleregex => sub { my $len=0; open (FH, "<$filename"); while ( <FH>){ $_ =~ /(.)$/; } close FH; }, slurp_length => sub { my $len=0; open (FH, "<$filename"); while ( <FH>){ $len += length($_); } close FH; }, } ); -----> RESULTS -----> s/iter raw_sysread_onechar nonraw_sysread_onechar +raw_slurp_substr getc nonraw_sysread_buffer slurp_regex raw_sysread_b +uffer slurp_substr raw_sysread_onechar 2.97 -- -0% -4% -19% -46% -51% -53% -70% nonraw_sysread_onechar 2.95 0% -- -3% -19% -46% -51% -52% -70% raw_slurp_substr 2.85 4% 4% -- -16% -44% -49% -51% -69 +% getc 2.40 24% 23% 19% -- -33% -40% -41% -63% + nonraw_sysread_buffer 1.60 86% 85% 79% 50% -- -10% -12% -45% slurp_regex 1.45 105% 104% 97% 66% 11% -- -3% -39% raw_sysread_buffer 1.41 111% 110% 103% 70% 13% 3% -- -37% slurp_substr 0.886 235% 234% 222% 171% 80% 63% 59% -- Rate slurp_substr slurp_length slurp..regex slurp_substr 1.14/s -- -96% -97% slurp_length 31.2/s 2634% -- -12% slurp_simpleregex 35.7/s 3025% 14% --

    As you can see from the first test, adding the "raw" option doesn't help, and sysread generally slows things down doesn't seem to help at all. In general, the best performance I can get is by slurping a bunch of content with the standard $line=<FH> syntax, then indexing into the string using substr.

    But as you can see from the 2nd benchmark, the performance of this option is actually pretty terrible. I can read in the entire contents of the file (or even test with regex, without assigning individual characters to an accesible variable) in 1/30th the time it takes to look at each character in the file. That's much worse than what I've seen (but haven't tested here) in C.

    I still hold out hope that there's a faster option - and based on your complete treatment of the topic in your original post, hope you'll either a) see what I'm doing wrong, or b) come up with another speedup idea.

    Thanks again, Travis

      a) see what I'm doing wrong,

      The first thing you are doing wrong is that you are comparing apples and oranges. Take your 2nd benchmark.

      cmpthese( 10, { slurp_substr => sub { open (FH, "<$filename"); my $i = 0; while ( <FH>) { while ($ch = substr($_,$i++,1)){ } } close FH; }, slurp_simpleregex => sub { my $len=0; open (FH, "<$filename"); while ( <FH>){ $_ =~ /(.)$/; } close FH; }, slurp_length => sub { my $len=0; open (FH, "<$filename"); while ( <FH>){ $len += length($_); } close FH; }, });
      1. Slurp_substr()

        This reads the whole file, record-by-record, and then appears to set the (global) variable $ch to each character in each record.

        But, your setting the variable $i outside the main loop; incrementing it for each char in the record; but never resetting it.

        Hence, for the second and subsequent records, $i will have whatever value it had at the end of the previous record. If the first record is longer than the rest, it will do nothing for any record other than the first.

        Both slurp_substr() and raw_slurp_substr() routines in the 1st benchmark are similarly afflicted.

      2. slurp_simpleregex()

        Your regex says put the last character of each record into $1. Your simply ignoring every character except the last in each record.

      3. slurp_length()

        This is the most mysterious of all. You read each record in the file and accumulate the lengths of those records into the variable $len.

        You never access any of the characters?

      The first rule of benchmarking is that you need to ensure that you are comparing like with like.

      The second rule is that you need to make sure that what you are benchmarking is useful and relevant to your final program.

      In these tests, you are doing neither.


      That's much worse than what I've seen (but haven't tested here) in C.

      If your point is that Perl is slower than C. Your right.


      Examine what is said, not who speaks.
      "Efficiency is intelligent laziness." -David Dunham
      "Think for yourself!" - Abigail
        The "failing to reset $i" bug is duely noted. The impact of moving "my $i=0" inside the "while (<FH>)" line is simply that each of those cases slows down, leaving me with the results:

        s/iter raw_slurp_substr nonraw_sysread_onechar +raw_sysread_onechar getc slurp_substr nonraw_sysread_buffer slurp_reg +ex raw_sysread_buffer raw_slurp_substr 3.68 -- -16%-17% -34% -53% -57% -60% -62% nonraw_sysread_onechar 3.08 19% -- -1% -22% -45% -49% -53% -54% raw_sysread_onechar 3.06 20% 1% -- -21% -44% -48% -52% -54% getc 2.42 52% 27% 26% -- -29% -35% -40% -42% + slurp_substr 1.71 115% 80% 79% 41% -- -8% -15% -18% nonraw_sysread_buffer 1.58 133% 95% 93% 53% 8% -- -8% -11% slurp_regex 1.46 152% 111% 110% 66% 17% 8% -- -4% raw_sysread_buffer 1.41 161% 119% 117% 72% 22% 12% 4% -- Rate raw_sysread_buffer slurp_length slurp_simpleregex raw_sysread_buffer 0.706/s -- -98% -98% slurp_length 31.2/s 4328% -- -9% slurp_simpleregex 34.5/s 4786% 10% --

        But I'm not sure how this is apples and oranges. There are two benchmark tests here:

        In the first, after the bug fix, the raw/sysread buffer approach works best, but is only about 10% better than just slurping in the contents with (<FH>). (and the raw/sysread_onechar approach is actually worse than getc). In general, your final result shows improvement, but it's not as fantastic as I'd hoped. Perhaps this is a function of the OS in use (such dramatic differences between your results and mine suggest that you may be using Windows (I'm on Linux)...and that getc may be really terrible in Windows - is that right?). I'd be interested in seeing the results you get when you run the same benchmark (after fixing the $i bug you mention).

        In the second benchmark, my point is a bit more interesting (to me) than simply saying that Perl is slower than C. I've reposted, comparing my two "fast" approaches to raw_sysread_buffer. The point of &slurp_length and &slurp_simpleregex is to show that the thing that makes &raw_sysread_buffer (and the others) so remarkably slow is not the actual act of reading from disk, but the act of accessing the values one at a time. For example, the regex test simply aims to show that I must have read the entire block from disk (I got the last character). In these tests, I'm not meaning to say that Perl is slower than C, I'm saying that Perl (as I'm using it) is unbearably slower than expected (by me).

        This amazing slowness in this one application is surprising to me, because I've generally found Perl to be pretty darned fast. This has been especially true in dealing with text (i.e. regex).


        A couple small notes:

        I have no interest in writing this in C. Perl is my preferred language, and it was my intention to show the C-lovers I work with that Perl is a perfectly good tool for this sort of task. I'm having a (much) harder time proving that than I'd hoped I would. Perhaps I'm wrong :(

        I also have no intention of flaming you with my response. It's clear to me that you've taken a good deal of time to think about my problem, and I'm most appreciative of that time. The intention of my response is simply to show that my benefits don't match your expectations, and to see if you can suggest another approach.