comment on

The "sys" method is the only one that has to actually check size of the file. The other methods don't have this overhead; they just read until they get EOF. Perhaps checking the file size is relatively slow on Windows?

Another problem with the "sys" method is that it only works on plain files; you can't use it to slurp from a device or pipe, because it won't be able to get an accurate file size.

Here's an interesting variation. It uses sysread, but avoids having to fetch the file's size by doing several fixed-size (but large) sysreads in a loop:

use Fcntl;
use Benchmark qw(cmpthese);
cmpthese(1000,
     {
      slurp=>sub{
        open(SITE, "/usr/share/dict/words");
        my $xml;
        local($/);
        undef $/;
        $xml = <SITE>;
        close(SITE);
      }
      ,
      sys=>sub{
        sysopen(SITE, "/usr/share/dict/words", O_RDONLY);
        sysread SITE, my $xml, -s SITE;
        close(SITE);
        }
      ,
      sysby128=>sub{
        sysopen(SITE, "/usr/share/dict/words", O_RDONLY);
        my $xml = '';
        while (sysread(SITE, $xml, 1024 * 128, length($xml))) { };
        close(SITE);
        }
      ,
      sysby256=>sub{
        sysopen(SITE, "/usr/share/dict/words", O_RDONLY);
        my $xml = '';
        while (sysread(SITE, $xml, 1024 * 256, length($xml))) { };
        close(SITE);
        }
      ,
      sysby512=>sub{
        sysopen(SITE, "/usr/share/dict/words", O_RDONLY);
        my $xml = '';
        while (sysread(SITE, $xml, 1024 * 512, length($xml))) { };
        close(SITE);
        }
      }
);
[download]

I've set up three versions, reading different amounts of data per sysread. My "words" file is around 409k, so the sysby512 trial will actually read the whole file at once (though it will call sysread a second time to discover it's at EOF). Here's the benchmark on an unloaded system:

> uname -a
Linux linux.local 2.4.16 #4 Mon Dec 10 08:26:03 PST 2001 i586 unknown
> perl index.pl
Benchmark: timing 1000 iterations of slurp, sys, sysby128, sysby256, s
+ysby512...
     slurp:  9 wallclock secs ( 3.56 usr +  4.57 sys =  8.13 CPU) @ 12
+3.00/s (n=1000)
       sys:  7 wallclock secs ( 0.17 usr +  5.66 sys =  5.83 CPU) @ 17
+1.53/s (n=1000)
  sysby128:  7 wallclock secs ( 0.27 usr +  5.87 sys =  6.14 CPU) @ 16
+2.87/s (n=1000)
  sysby256:  7 wallclock secs ( 0.18 usr +  5.69 sys =  5.87 CPU) @ 17
+0.36/s (n=1000)
  sysby512:  7 wallclock secs ( 0.16 usr +  5.51 sys =  5.67 CPU) @ 17
+6.37/s (n=1000)
          Rate    slurp sysby128 sysby256      sys sysby512
slurp    123/s       --     -24%     -28%     -28%     -30%
sysby128 163/s      32%       --      -4%      -5%      -8%
sysby256 170/s      39%       5%       --      -1%      -3%
sys      172/s      39%       5%       1%       --      -3%
sysby512 176/s      43%       8%       4%       3%       --
[download]

And here's another run, running XMMS (a GUI-based MP3 player) to load the system a bit:

> perl index.pl
Benchmark: timing 1000 iterations of slurp, sys, sysby128, sysby256, s
+ysby512...
     slurp: 12 wallclock secs ( 4.29 usr +  5.43 sys =  9.72 CPU) @ 10
+2.88/s (n=1000)
       sys:  8 wallclock secs ( 0.10 usr +  6.88 sys =  6.98 CPU) @ 14
+3.27/s (n=1000)
  sysby128:  9 wallclock secs ( 0.21 usr +  6.98 sys =  7.19 CPU) @ 13
+9.08/s (n=1000)
  sysby256:  8 wallclock secs ( 0.25 usr +  6.74 sys =  6.99 CPU) @ 14
+3.06/s (n=1000)
  sysby512:  9 wallclock secs ( 0.15 usr +  6.70 sys =  6.85 CPU) @ 14
+5.99/s (n=1000)
          Rate    slurp sysby128 sysby256      sys sysby512
slurp    103/s       --     -26%     -28%     -28%     -30%
sysby128 139/s      35%       --      -3%      -3%      -5%
sysby256 143/s      39%       3%       --      -0%      -2%
sys      143/s      39%       3%       0%       --      -2%
sysby512 146/s      42%       5%       2%       2%       --
[download]

As you can see, all of the looping sysread methods perform quite respectably compared to the single-sysread method. The sysby512 method actually does better, probably because it avoids having to fetch the file size. If getting the file size is slow on Windows, the performance improvement should be even greater.

In reply to Re: Re: Re: Reading entire file into scalar: speed differences? by kjherron
in thread Reading entire file into scalar: speed differences? by theguvnor

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.