Re^2: Matching data in a big file

Lately, I've just been using:

$document = join("", <RTF_FILE>);
[download]

Compared to the localize technique, I find it clear and easy to type. It is, however, very inefficient. It's fast enough, though, that until today I had never noticed the time it takes to load the file. Will I continue to use the join version? Certainly--except for a task where I need to slurp enough files where it would make a significant difference.

Just because I have it, here's the benchmark code & results. (You don't have to open it, the results are that File::Slurp and the local technique are roughly equivalent, and both are 10 times faster than the join version.)

#!/usr/bin/perl
use strict;
use warnings;
use Benchmark qw(:all);
use Data::Dumper;
use File::Slurp;

my $FName = shift // 'TESTDATA';
my $iterations = shift // 500;

# Both for error checking *AND* warming up the disk cache
my $expected = length(slurp_local($FName));

my %cnt;

sub slurp_local {
    my $Name = shift;
    open my $FH, '<', $Name;
    local $/;
    my $str = <$FH>;
    return $str;
}

sub slurp_join {
    my $Name = shift;
    open my $FH, '<', $Name;
    my $str = join("",<$FH>);
    return $str;
}

sub slurp_for {
    my $Name = shift;
    open my $FH, '<', $Name;
    my $str = '';
    $str .= $_ for <$FH>;
    return $str;
}

print "Reading file '$FName' $iterations times with each routine.\n";

cmpthese($iterations, {
        slurp_local =>      sub { $cnt{slurp_local}{length(slurp_local
+($FName))}++; },
        slurp_join =>       sub { $cnt{slurp_join}{length(slurp_join($
+FName))}++; },
        slurp_for =>        sub { $cnt{slurp_for}{length(slurp_for($FN
+ame))}++; },
        slurp_FS =>         sub { $cnt{slurp_FS}{length(read_file($FNa
+me))}++; },
});

for my $k (keys %cnt) {
    for my $v (keys %{$cnt{$k}}) {
        my $msg = '';
        $msg = 'ERROR! wrong length!' if $v != $expected;
        printf "% 8u %s %s\n", $v, $k, $msg;
    }
}
[download]

The results show that it's pretty consistent with different file sizes for largish files: File::Slurp and the localize version are about 10 times faster on my current machine. For smaller files, where the differences kick in, the elapsed times are too small for me to care about. Note: I didn't bother worrying about whether the module load time for File::Slurp impacted the results.

$ perl slurp_bench.pl TD3 30
Reading file 'TD3' 30 times with each routine.
            s/iter   slurp_for  slurp_join slurp_local    slurp_FS
slurp_for     4.81          --         -8%        -90%        -92%
slurp_join    4.41          9%          --        -89%        -91%
slurp_local  0.485        890%        808%          --        -21%
slurp_FS     0.385       1150%       1045%         26%          --
99299497 slurp_local 
99299497 slurp_join 
99299497 slurp_for 
99299497 slurp_FS 

$ perl slurp_bench.pl TD2 300
Reading file 'TD2' 300 times with each routine.
              Rate   slurp_for  slurp_join slurp_local    slurp_FS
slurp_for   2.30/s          --        -11%        -91%        -91%
slurp_join  2.59/s         13%          --        -89%        -90%
slurp_local 24.5/s        968%        849%          --         -9%
slurp_FS    26.9/s       1071%        940%         10%          --
 9027227 slurp_local 
 9027227 slurp_join 
 9027227 slurp_for 
 9027227 slurp_FS 

$ perl slurp_bench.pl TESTDATA 500
Reading file 'TESTDATA' 500 times with each routine.
              Rate   slurp_for  slurp_join slurp_local    slurp_FS
slurp_for   25.8/s          --         -9%        -90%        -91%
slurp_join  28.3/s         10%          --        -89%        -90%
slurp_local  248/s        860%        775%          --        -12%
slurp_FS     281/s        990%        893%         13%          --
  820657 slurp_local 
  820657 slurp_join 
  820657 slurp_for 
  820657 slurp_FS 

Reading file 'TDSmall' 10000 times with each routine.
               Rate   slurp_for  slurp_join    slurp_FS slurp_local
slurp_for    3774/s          --        -12%        -67%        -69%
slurp_join   4292/s         14%          --        -62%        -64%
slurp_FS    11364/s        201%        165%          --         -6%
slurp_local 12048/s        219%        181%          6%          --
    3391 slurp_local 
    3391 slurp_join 
    3391 slurp_for 
    3391 slurp_FS 

Reading file 'TDTiny' 25000 times with each routine.
               Rate    slurp_FS   slurp_for  slurp_join slurp_local
slurp_FS    10504/s          --         -3%         -8%        -24%
slurp_for   10776/s          3%          --         -6%        -22%
slurp_join  11416/s          9%          6%          --        -17%
slurp_local 13736/s         31%         27%         20%          --
     198 slurp_local 
     198 slurp_join 
     198 slurp_for 
     198 slurp_FS
[download]

...roboticus

When your only tool is a hammer, all problems look like your thumb.

Comment on Re^2: Matching data in a big file Select or Download Code

Replies are listed 'Best First'.
Re^3: Matching data in a big file by hdb (Monsignor) on Dec 14, 2013 at 18:25 UTC
Thanks roboticus for your useful comparison.	[reply]