in reply to Re: Matching data in a big file
in thread Matching data in a big file

hdb:

Lately, I've just been using:

$document = join("", <RTF_FILE>);

Compared to the localize technique, I find it clear and easy to type. It is, however, very inefficient. It's fast enough, though, that until today I had never noticed the time it takes to load the file. Will I continue to use the join version? Certainly--except for a task where I need to slurp enough files where it would make a significant difference.

Just because I have it, here's the benchmark code & results. (You don't have to open it, the results are that File::Slurp and the local technique are roughly equivalent, and both are 10 times faster than the join version.)

#!/usr/bin/perl use strict; use warnings; use Benchmark qw(:all); use Data::Dumper; use File::Slurp; my $FName = shift // 'TESTDATA'; my $iterations = shift // 500; # Both for error checking *AND* warming up the disk cache my $expected = length(slurp_local($FName)); my %cnt; sub slurp_local { my $Name = shift; open my $FH, '<', $Name; local $/; my $str = <$FH>; return $str; } sub slurp_join { my $Name = shift; open my $FH, '<', $Name; my $str = join("",<$FH>); return $str; } sub slurp_for { my $Name = shift; open my $FH, '<', $Name; my $str = ''; $str .= $_ for <$FH>; return $str; } print "Reading file '$FName' $iterations times with each routine.\n"; cmpthese($iterations, { slurp_local => sub { $cnt{slurp_local}{length(slurp_local +($FName))}++; }, slurp_join => sub { $cnt{slurp_join}{length(slurp_join($ +FName))}++; }, slurp_for => sub { $cnt{slurp_for}{length(slurp_for($FN +ame))}++; }, slurp_FS => sub { $cnt{slurp_FS}{length(read_file($FNa +me))}++; }, }); for my $k (keys %cnt) { for my $v (keys %{$cnt{$k}}) { my $msg = ''; $msg = 'ERROR! wrong length!' if $v != $expected; printf "% 8u %s %s\n", $v, $k, $msg; } }

The results show that it's pretty consistent with different file sizes for largish files: File::Slurp and the localize version are about 10 times faster on my current machine. For smaller files, where the differences kick in, the elapsed times are too small for me to care about. Note: I didn't bother worrying about whether the module load time for File::Slurp impacted the results.

$ perl slurp_bench.pl TD3 30 Reading file 'TD3' 30 times with each routine. s/iter slurp_for slurp_join slurp_local slurp_FS slurp_for 4.81 -- -8% -90% -92% slurp_join 4.41 9% -- -89% -91% slurp_local 0.485 890% 808% -- -21% slurp_FS 0.385 1150% 1045% 26% -- 99299497 slurp_local 99299497 slurp_join 99299497 slurp_for 99299497 slurp_FS $ perl slurp_bench.pl TD2 300 Reading file 'TD2' 300 times with each routine. Rate slurp_for slurp_join slurp_local slurp_FS slurp_for 2.30/s -- -11% -91% -91% slurp_join 2.59/s 13% -- -89% -90% slurp_local 24.5/s 968% 849% -- -9% slurp_FS 26.9/s 1071% 940% 10% -- 9027227 slurp_local 9027227 slurp_join 9027227 slurp_for 9027227 slurp_FS $ perl slurp_bench.pl TESTDATA 500 Reading file 'TESTDATA' 500 times with each routine. Rate slurp_for slurp_join slurp_local slurp_FS slurp_for 25.8/s -- -9% -90% -91% slurp_join 28.3/s 10% -- -89% -90% slurp_local 248/s 860% 775% -- -12% slurp_FS 281/s 990% 893% 13% -- 820657 slurp_local 820657 slurp_join 820657 slurp_for 820657 slurp_FS Reading file 'TDSmall' 10000 times with each routine. Rate slurp_for slurp_join slurp_FS slurp_local slurp_for 3774/s -- -12% -67% -69% slurp_join 4292/s 14% -- -62% -64% slurp_FS 11364/s 201% 165% -- -6% slurp_local 12048/s 219% 181% 6% -- 3391 slurp_local 3391 slurp_join 3391 slurp_for 3391 slurp_FS Reading file 'TDTiny' 25000 times with each routine. Rate slurp_FS slurp_for slurp_join slurp_local slurp_FS 10504/s -- -3% -8% -24% slurp_for 10776/s 3% -- -6% -22% slurp_join 11416/s 9% 6% -- -17% slurp_local 13736/s 31% 27% 20% -- 198 slurp_local 198 slurp_join 198 slurp_for 198 slurp_FS

...roboticus

When your only tool is a hammer, all problems look like your thumb.

Replies are listed 'Best First'.
Re^3: Matching data in a big file
by hdb (Monsignor) on Dec 14, 2013 at 18:25 UTC

    Thanks roboticus for your useful comparison.