kwaping has asked for the wisdom of the Perl Monks concerning the following question:

This node was inspired by Re^2: file reading issues.

I recently ran a simple test of the speed of these two blocks of code:
open(FILE,"<$file") || die $!; read FILE, my $data, -s $file; close(FILE);
open(FILE,"<$file") || die $!; my $data = do { local $/; <FILE> }; close(FILE);
I expected the second method (a traditional slurp) to be faster. However, I was surprised to find that the read function was almost twice as fast!

Can anyone explain why?

Here's the simple test I ran:
#!/usr/bin/perl use strict; use warnings; use Time::HiRes qw(time); $| = 1; my $file = '/path/to/file.pdf'; my $numtests = 1000; print "using read function\n"; for (my $x = 0; $x < 6; $x++) { my $start = time(); for (my $i = 0; $i <= $numtests; $i++) { open(FILE,"<$file") || die $!; read FILE, my $data, -s $file; close(FILE); } my $end = time(); print $end - $start,"\n"; } print "\ntradtional slurp\n"; for (my $x = 0; $x < 6; $x++) { my $start = time(); for (my $i = 0; $i <= $numtests; $i++) { open(FILE,"<$file") || die $!; my $data = do { local $/; <FILE> }; close(FILE); } my $end = time(); print $end - $start,"\n"; } # exit; #<- removed for blazar ;)

Here's the output:
using read function 0.612063884735107 0.62070107460022 0.599463939666748 0.60235595703125 0.610018014907837 0.603386878967285 tradtional slurp 1.03461217880249 1.0298318862915 1.05549097061157 1.08192896842957 1.02464509010315 1.0180230140686

Replies are listed 'Best First'.
Re: Speed reading (files)
by BrowserUk (Patriarch) on Aug 04, 2005 at 16:06 UTC

    With read, you are telling it how big the file is, so it can preallocate a buffer to that size and fill it in a single call to the system (even if the system chooses to break it into smaller reads from disk). So, one call to the system to allocate the memory. One call to perform the read.

    With the traditional slurp, it doesn't know how big the file is, so it

    1. allocates a buffer (probably 4k),
    2. calls the systems to read 4k,
    3. checks to see if 4k was read,
      • if not, finished.
      • if so, reallocates the buffer to +4k (which probably means the first 4k gets copied).
    4. repeat from step 2.

    Try using:

    my $data = do{ local $/ = \( -s FILE ); <FILE> };;

    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    Lingua non convalesco, consenesco et abolesco. -- Rule 1 has a caveat! -- Who broke the cabal?
    "Science is about questioning the status quo. Questioning authority".
    The "good enough" maybe good enough for the now, and perfection maybe unobtainable, but that should not preclude us from striving for perfection, when time, circumstance or desire allow.
Re: Speed reading (files)
by blazar (Canon) on Aug 04, 2005 at 15:38 UTC
    Not that this really did ever care much in most practical situations I've faced myself (so that I most commonly stick with local $/;), but I have often seen mentioned that File::Slurp provides much optimized file slurping facilities.
      I too have read about File::Slurp and how it's supposed to be super fast. However, check this out:
      using read function 0.684900999069214 0.681570053100586 0.680304050445557 0.675194025039673 0.684563159942627 0.686581134796143 File::Slurp 1.57559299468994 1.5706889629364 1.5739688873291 1.5618691444397 1.56290698051453 1.58691692352295
      That's using the same test as above, replacing
      open(FILE,"<$file") || die $!; my $data = do { local $/; <FILE> }; close(FILE);
      with
      my $data = read_file($file);

        Have you seen the article in the File::Slurp distribution? At it's core, it's doing something very similar to the "read" approach. I'd guess the speed difference you're seeing is due to the option/error/context checking that it's doing, which your hand-coded example is not.

        -xdg

        Code written by xdg and posted on PerlMonks is public domain. It is provided as is with no warranties, express or implied, of any kind. Posted code may not have been tested. Use of posted code is at your own risk.

Re: Speed reading (files)
by izut (Chaplain) on Aug 04, 2005 at 16:25 UTC
    I did this test:
    use Benchmark qw(:all); use File::Slurp; $file_slurp = sub { $file = read_file('arq.pdf', binmode => ':raw') }; $read = sub { open $file, '<', 'arq.pdf' || die $!; read $file, +$data, -s 'arq.pdf'; close $file; }; timethese(1000, { file_slurp => $file_slurp, read => $read, }); cmpthese(1000, { file_slurp => $file_slurp, read => $read, });
    I got this results:
    Benchmark: timing 1000 iterations of file_slurp, read... file_slurp: 1 wallclock secs ( 0.43 usr + 0.31 sys = 0.74 CPU) @ 13 +51.35/s (n=1000) read: 6 wallclock secs ( 4.88 usr + 0.44 sys = 5.32 CPU) @ 18 +7.97/s (n=1000) Rate read file_slurp read 192/s -- -86% file_slurp 1333/s 595% --
    I see that file_slurp in this case is the best option. I've used a 80k file with 1000 lines and 79 columns.

    Using a 2.5Mb pdf file the results were different:

    file_slurp: 44 wallclock secs (15.99 usr + 26.40 sys = 42.39 CPU) @ 23 +.59/s (n=1000) read: 23 wallclock secs ( 9.12 usr + 13.67 sys = 22.79 CPU) @ 43 +.88/s (n=1000) Rate file_slurp read file_slurp 23.6/s -- -46% read 43.6/s 85% --
    I this case File::Slurp was slower than read(). I see that File::Slurp has a binopen mode. After changing to binmode:
    file_slurp: 46 wallclock secs (16.01 usr + 26.25 sys = 42.26 CPU) @ 23 +.66/s (n=1000) read: 25 wallclock secs ( 9.00 usr + 13.91 sys = 22.91 CPU) @ 43 +.65/s (n=1000) Rate file_slurp read file_slurp 23.7/s -- -46% read 43.5/s 84% --
    I think in bigger files read() has better performance than File::Slurp... I don't know why but I hope it helps :-)


    Igor S. Lopes - izut
    surrender to perl. your code, your rules.
Re: Speed reading (files)
by anonymized user 468275 (Curate) on Aug 04, 2005 at 15:52 UTC
    The slurp is not as ravenous as its nickname suggests. Although it appears even during debug to slurp in one go, behind the scenes the <> operator is performing I/O system calls that are consistent with reading by delimiter; that in spite of your 'local $/'. On the other hand, read's buffers are of an optimised size. As a rule, the larger the buffer size of each actual I/O system call, the fewer such calls have to be performed and the more efficiently the whole file is read in.

    One world, one people

Re: Speed reading (files)
by polettix (Vicar) on Aug 04, 2005 at 17:10 UTC
    The repetition of the test makes it highly probable that the file is cached in RAM. On one side, this means that the results you get are probably indicative of the real performance of the software portion of the reading process, i.e. the comparison is meaningful to some extent. OTOH, real-world situations in which you have to read gigabytes of data will probably suffer a bottleneck from the actual reading from the device, so the differences would probably disappear.

    The interesting thing is in your answer to the node you pointed. You do this because that's how you learned how to do it - because that's how you see how to do it almost everywhere (apart from the use of File::Slurp). This is no cargo cult IMHO, but a useful Perl-level idiom to say "slurp that file"; to this extent, using read would be way too "low-level"!

    Flavio
    perl -ple'$_=reverse' <<<ti.xittelop@oivalf

    Don't fool yourself.
Re: Speed reading (files)
by ikegami (Patriarch) on Aug 04, 2005 at 17:39 UTC
    IIRC,
    open(FILE,"<$file") || die $!; my $data = do { local $/; <FILE> }; close(FILE);
    is slower than
    open(FILE,"<$file") || die $!; my $data; { local $/; $data = <FILE> } close(FILE);
    which is the true traditional slurp.
      Read is still faster, but the difference is very minute. Using read is still no worse than the true traditional slurp, which I find interesting.
      using read function 0.614683866500854 0.604650020599365 0.616497039794922 0.640319108963013 0.603079795837402 0.605829000473022 ikegami's code 0.672801971435547 0.667882919311523 0.666944980621338 0.670393943786621 0.712579011917114 0.767023086547852

        The advantage of "slurping" is that it works with ttys. I don't think -s does.

        Also, is read guaranteed to return the number of bytes requested (if the file is big enough)? sysread isn't.

        As an aside, wouldn't read FILE, my $data, $MAX_FILE_SIZE; be faster than using -s, especially on smaller files? I guess you have to find an acceptable $MAX_FILE_SIZE (in Config.pm, maybe???)

        By the way, I'm not suprised that read is as fast or faster than "slurping". Why wouldn't it be?