theguvnor has asked for the wisdom of the Perl Monks concerning the following question:

Quick question: is there a big speed difference between these two methods of reading an entire text/xml file?

1. while loop reading file line by line:

my $xml; while (<SITE>) { $xml .= $_; }
2. slurping all at once:
my $xml; { local($/); undef $/; $xml = <SITE>; }
I'd expect method 2 to be faster but I don't know if there are any other issues to consider. Thanks to anyone who has experience with this and cares to comment.

Replies are listed 'Best First'.
Re: Reading entire file into scalar: speed differences?
by blakem (Monsignor) on Jan 25, 2002 at 05:56 UTC
    Assuming it works on your OS and for your particular file, the following code will be faster than either of your options above:
    sysread SITE, my $xml, -s SITE;
    See the discussion (including benchmarks) at Slurp a file

    -Blake

      I'm credited for that idiom in "Perl for System Administration". :)

      _____________________________________________________
      Jeff[japhy]Pinyan: Perl, regex, and perl hacker.
      s++=END;++y(;-P)}y js++=;shajsj<++y(p-q)}?print:??;

Re: Reading entire file into scalar: speed differences?
by belg4mit (Prior) on Jan 25, 2002 at 06:14 UTC
    My idea of join with list-context read seems to do pretty well (if I read the results correctly).

    UPDATE: Err nope. that's per second. Okay, I didn't think it should be fast. I just wanted to see.

    use Benchmark qw(cmpthese); cmpthese(50, { while=>sub{ open(SITE, "/usr/share/dict/words"); my $xml; while (<SITE>) { $xml .= $_; } close(SITE); } , slurp=>sub{ open(SITE, "/usr/share/dict/words"); my $xml; local($/); undef $/; $xml = <SITE>; close(SITE); } , join=>sub{ open(SITE, "/usr/share/dict/words"); my $xml = join('', <SITE>); close(SITE); } , sys=>sub{ sysopen(SITE, "/usr/share/dict/words", O_RDONLY); sysread SITE, my $xml, -s SITE; close(SITE); } } ); __END__ Benchmark: timing 50 iterations of join, slurp, sys, while... join: 8 wallclock secs ( 7.99 usr + 0.17 sys = 8.16 CPU) @ 6 +.13/s (n=50) slurp: 0 wallclock secs ( 0.12 usr + 0.14 sys = 0.26 CPU) @ 19 +2.31/s (n=50) (warning: too few iterations for a reliable count) sys: 1 wallclock secs ( 0.01 usr + 0.25 sys = 0.26 CPU) @ 19 +2.31/s (n=50) (warning: too few iterations for a reliable count) while: 5 wallclock secs ( 5.09 usr + 0.12 sys = 5.21 CPU) @ 9 +.60/s (n=50) Rate join while sys slurp join 6.13/s -- -36% -97% -97% while 9.60/s 57% -- -95% -95% sys 192/s 3038% 1904% -- -0% slurp 192/s 3038% 1904% 0% --

    --
    perl -pe "s/\b;([st])/'\1/mg"

      Using a single sysread() helps you win a probabilistic scheduling game, where you are competing against other programs that are also trying to move a disk head to fetch data. Moving a disk head is expensive, and if you can get your entire read satisfied with minimal intervening moves, you win. If you block during a read() and another program (or set of programs) gets requests queued that cause the head to move, you might lose unless the OS is smart enough to be doing read-ahead caching.

      If you're the only person on the box, Benchmark tests are rather suspect, since there's not a lot of competition for the disk head. It would be much more interesting to see how the benchmarks differed if you were to run 20 of them simultaneously.

        # uname -a SunOS host 5.8 Generic_108528-08 sun4u sparc SUNW,Sun-Fire-280R #uptime #shortly before running the test 9:41pm up 103 day(s), 13:13, 131 users, load average: 0.05, 0.07, + 0.11 Benchmark: timing 50 iterations of join, while... join: 2 wallclock secs ( 2.09 usr + 0.03 sys = 2.11 CPU) @ 23 +.64/s (n=50) while: 2 wallclock secs ( 1.87 usr + 0.04 sys = 1.92 CPU) @ 26 +.10/s (n=50) Rate join while join 23.6/s -- -9% while 26.1/s 10% -- #Required more iterations for accuracy on this machine Benchmark: timing 5000 iterations of slurp, sys... slurp: 4 wallclock secs ( 1.61 usr + 2.30 sys = 3.91 CPU) @ 12 +80.08/s (n=5000) sys: 2 wallclock secs ( 0.12 usr + 1.86 sys = 1.99 CPU) @ 25 +16.36/s (n=5000) slurp 1280/s -- -49% sys 2516/s 97% --

        --
        perl -pe "s/\b;([st])/'\1/mg"

      Since I'm assuming all the other posters were running under a unix of some sort, and though it might well make a difference, here's some numbers for win32 (win xp, perl 5.6.1, ActiveState build 630):

      D:\Documents and Settings\James\Desktop>perl timereads.pl Benchmark: running join, slurp, sys, while, each for at least 5 CPU se +conds... join: 6 wallclock secs ( 3.92 usr + 1.33 sys = 5.25 CPU) @ 28 +23.36/s (n =14817) slurp: 5 wallclock secs ( 2.39 usr + 2.88 sys = 5.28 CPU) @ 58 +13.53/s (n =30678) sys: 5 wallclock secs ( 0.01 usr + 5.06 sys = 5.07 CPU) @ 11 +.25/s (n=5 7) while: 5 wallclock secs ( 4.03 usr + 1.28 sys = 5.31 CPU) @ 28 +34.21/s (n =15044) Rate sys join while slurp sys 11.2/s -- -100% -100% -100% join 2823/s 24998% -- -0% -51% while 2834/s 25095% 0% -- -51% slurp 5814/s 51579% 106% 105% --

      (I used D:\WINXP\system32\oembios.bin (12.5 MB, and a significantly lower proportion of newlines.), and "at least 5 cpu seconds" of time, so these numbers have slightly different bases then belg4mit's.)

      Note how terribly sys does in my comparisigns. These are completely different results, in essence, which confuses me a lot.

      This is rapidly getting offtopic, but does anybody have a clue why?

      TACCTGTTTGAGTGTAACAATCATTCGCTCGGTGTATCCATCTTTG ACACAATGAATCTTTGACTCGAACAATCGTTCGGTCGCTCCGACGC
        The "sys" method is the only one that has to actually check size of the file. The other methods don't have this overhead; they just read until they get EOF. Perhaps checking the file size is relatively slow on Windows?

        Another problem with the "sys" method is that it only works on plain files; you can't use it to slurp from a device or pipe, because it won't be able to get an accurate file size.

        Here's an interesting variation. It uses sysread, but avoids having to fetch the file's size by doing several fixed-size (but large) sysreads in a loop:

        use Fcntl; use Benchmark qw(cmpthese); cmpthese(1000, { slurp=>sub{ open(SITE, "/usr/share/dict/words"); my $xml; local($/); undef $/; $xml = <SITE>; close(SITE); } , sys=>sub{ sysopen(SITE, "/usr/share/dict/words", O_RDONLY); sysread SITE, my $xml, -s SITE; close(SITE); } , sysby128=>sub{ sysopen(SITE, "/usr/share/dict/words", O_RDONLY); my $xml = ''; while (sysread(SITE, $xml, 1024 * 128, length($xml))) { }; close(SITE); } , sysby256=>sub{ sysopen(SITE, "/usr/share/dict/words", O_RDONLY); my $xml = ''; while (sysread(SITE, $xml, 1024 * 256, length($xml))) { }; close(SITE); } , sysby512=>sub{ sysopen(SITE, "/usr/share/dict/words", O_RDONLY); my $xml = ''; while (sysread(SITE, $xml, 1024 * 512, length($xml))) { }; close(SITE); } } );
        I've set up three versions, reading different amounts of data per sysread. My "words" file is around 409k, so the sysby512 trial will actually read the whole file at once (though it will call sysread a second time to discover it's at EOF). Here's the benchmark on an unloaded system:
        > uname -a Linux linux.local 2.4.16 #4 Mon Dec 10 08:26:03 PST 2001 i586 unknown > perl index.pl Benchmark: timing 1000 iterations of slurp, sys, sysby128, sysby256, s +ysby512... slurp: 9 wallclock secs ( 3.56 usr + 4.57 sys = 8.13 CPU) @ 12 +3.00/s (n=1000) sys: 7 wallclock secs ( 0.17 usr + 5.66 sys = 5.83 CPU) @ 17 +1.53/s (n=1000) sysby128: 7 wallclock secs ( 0.27 usr + 5.87 sys = 6.14 CPU) @ 16 +2.87/s (n=1000) sysby256: 7 wallclock secs ( 0.18 usr + 5.69 sys = 5.87 CPU) @ 17 +0.36/s (n=1000) sysby512: 7 wallclock secs ( 0.16 usr + 5.51 sys = 5.67 CPU) @ 17 +6.37/s (n=1000) Rate slurp sysby128 sysby256 sys sysby512 slurp 123/s -- -24% -28% -28% -30% sysby128 163/s 32% -- -4% -5% -8% sysby256 170/s 39% 5% -- -1% -3% sys 172/s 39% 5% 1% -- -3% sysby512 176/s 43% 8% 4% 3% --
        And here's another run, running XMMS (a GUI-based MP3 player) to load the system a bit:
        > perl index.pl Benchmark: timing 1000 iterations of slurp, sys, sysby128, sysby256, s +ysby512... slurp: 12 wallclock secs ( 4.29 usr + 5.43 sys = 9.72 CPU) @ 10 +2.88/s (n=1000) sys: 8 wallclock secs ( 0.10 usr + 6.88 sys = 6.98 CPU) @ 14 +3.27/s (n=1000) sysby128: 9 wallclock secs ( 0.21 usr + 6.98 sys = 7.19 CPU) @ 13 +9.08/s (n=1000) sysby256: 8 wallclock secs ( 0.25 usr + 6.74 sys = 6.99 CPU) @ 14 +3.06/s (n=1000) sysby512: 9 wallclock secs ( 0.15 usr + 6.70 sys = 6.85 CPU) @ 14 +5.99/s (n=1000) Rate slurp sysby128 sysby256 sys sysby512 slurp 103/s -- -26% -28% -28% -30% sysby128 139/s 35% -- -3% -3% -5% sysby256 143/s 39% 3% -- -0% -2% sys 143/s 39% 3% 0% -- -2% sysby512 146/s 42% 5% 2% 2% --
        As you can see, all of the looping sysread methods perform quite respectably compared to the single-sysread method. The sysby512 method actually does better, probably because it avoids having to fetch the file size. If getting the file size is slow on Windows, the performance improvement should be even greater.
      OK so if I read the Benchmark summary correctly (I've really gotta learn how to use that thing so I too can amaze my friends ;-) the slurp method is quite a bit faster than the method I have been using up until now.

      Thanks all!

Re: Reading entire file into scalar: speed differences?
by Anonymous Monk on Jan 25, 2002 at 09:52 UTC
    Rumour has it that it's faster to read blocks instead of the entire file. This isn't confirmed though, and there might be other issues, like described in this node.

    See it as an exercise (and use different block sizes) and please post your results here. :)