(tye)Re: Why use <$fh> at all?

Benchmarking is a complex thing. See podmaster's node showing that <> version is much faster for him and that a simple change makes it faster still. More on this later. (Oh, and your code is badly broken.)

I stand by a previous statement of mine: I consider perl to be broken if it can't internally implement a faster version of "read a block-at-a-time and split into lines" than you can implement it by writing Perl code. After all, if it can't, we'd be better off replacing the <> implementation with some external module implemented purely in Perl (which could then be converted to C since it all ends up in C anyway, and then optimized, etc.).

But the fact is that Perl is broken. Perl went out of its way to make <> fast. It did so by doing some "interesting" tricks which meant that, in Perl, <> was sometimes faster than fgets() in C. The problem with this was that these tricks didn't work so well in all cases.

It is a bit like what was written in the original node. Now you've got some complex, hard-to-maintain code that is a bit faster than some very simple, portable code. When things change (like Linux or PerlIO), suddenly you end up with big, complex code that is also slow. This describes what happened to Perl and it also describes what some are trying to do to "fix" it.

If you need lines from a file, then use <>. If that ends up being uncomfortably slow, then you might want to look into doing it a different way, like our opening example. But don't "optimize" before you need to.

And here are my results for the benchmarks. I noticed that you cheated by letting your big code use binmode (which means save the C libraries some translation work on some platforms) and which means that your replacement code doesn't even give the same results. So I fixed that (which, on my platform, makes about a 20% difference in speed).

Then I checked for other bugs in your code. And this is why you don't get so obsessed about speed! Great, you have code that you think is at lot faster but I've already found two bugs in it (make that 3, if there is no final newline). Put a lot more effort into getting the code correct and a lot less worry into how fast it runs.

So I rewrote your block-at-a-time code because I thought I saw some places where I could make it faster. (:

And this is when I found the fourth bug! And this was a big one, that completely invalidates the speed tests for the input file I was using.

The reason your block-at-a-time code is so much faster is probably that it says $i > 0 instead of $i >= 0 which means that it manages to read a fraction of the total number of lines.

Sorry, I have to run now.

- tye (but my friends call me join"",'T','y','e')

#!/usr/bin/perl -w

use strict;
use Benchmark qw( cmpthese );

sub blockold {
    my $fh;
    open $fh, 'vsfull.csv';
    #binmode $fh;
    my @lines;
    my $block;
    my $left= '';
    while(  read $fh, $block, 8192  ) {
        $block = $left . $block;
        my $i = index $block, "\n";
        while($i > 0){
            push @lines, substr($block,0,$i);
            substr $block, 0, $i+1, '';
            $i = index $block, "\n";
        }
        $left = $block;
    }
    return @lines;
}

sub blockfix {
    my $fh;
    open $fh, 'vsfull.csv';
    my @lines;
    my $block;
    my $left= '';
    while(  read $fh, $block, 8192  ) {
        $block = $left . $block;
        my $i = index $block, "\n";
        while(  $i >= 0  ) {
            push @lines, substr($block,0,$i+1);
            substr $block, 0, $i+1, '';
            $i = index $block, "\n";
        }
        $left = $block;
    }
    return @lines;
}

sub blocknew {
    my $fh;
    open $fh, 'vsfull.csv';
    my @lines;
    my $block= '';
    while( read $fh, $block, 8192, length $block  ) {
        my $i;
        my $j= 0;
        while(  $i= 1 + index $block, "\n", $j  ) {
            push @lines, substr $block, $j, $i-$j;
            $j= $i;
        }
        substr( $block, 0, $j )= '';
    }
    return @lines;
}

sub splitnew {
    my $fh;
    open $fh, 'vsfull.csv';
    my @lines;
    my $block= '';
    while( read $fh, $block, 8192, length $block  ) {
        push @lines, split /(?<=\n)/, $block, -1;
        $block= pop @lines;
    }
    push @lines, $block   if  length $block;
    return @lines;
}

sub lineold {
    my $fh;
    my @lines;
    open $fh, 'vsfull.csv';
    while(<$fh>){
        push @lines, $_;
    }
    return @lines;
}

sub linenew {
    my $fh;
    my @lines;
    open $fh, 'vsfull.csv';
    @lines= <$fh>;
    return @lines;
}

my @bo= blockold();
my @bf= blockfix();
my @bn= blocknew();
my @lo= lineold();
my @ln= linenew();
my @sn= splitnew();
warn "blockold is broken!\n"   if  "@bo" ne "@ln";
warn "blockfix is broken!\n"   if  "@bf" ne "@ln";
warn "blocknew is broken!\n"   if  "@bn" ne "@ln";
warn "lineold is broken!\n"    if  "@lo" ne "@ln";
warn "splitnew is broken!\n"   if  "@sn" ne "@ln";

cmpthese( -3, {
    lo => \&lineold,
    ln => \&linenew,
    bo => \&blockold,
    bn => \&blocknew,
    bf => \&blockfix,
    sn => \&splitnew,
});
[download]

Comment on (tye)Re: Why use <$fh> at all? Select or Download Code