in reply to Perl's poor disk IO performance

I hope this reply isn't too long, but I'm a fan of disclosing source code to back up results.

The short of it is that I get 50 MB/sec when processing files line/by/line in Perl.

Perl file performance is near and dear to my heart, since I routinely work on multi-gigabyte files. I wrote a benchmark program a little while ago to help me to stay with Perl, because performance was tempting me to go to C++ or to bypass Perl's buffering and do it myself (with large sysread calls).

I ran this on a file that was exactly 100 MB long, with lots of small lines, so a somewhat worst-case for a naive line-at-a-time approach. This is a UTF-8 file, and I was particularly interested to figure out why my unicode-file reading was so pitifully slow on a Windows machine.

So my fix was to start specifying ":raw:perlio:utf8" on my file handles, and I got a 6x improvement in speed.

Line-at-a-time, default layers 100.0 MB in 12.012 sec, 8.3 MB/sec Line-at-a-time, :raw:perlio 100.0 MB in 1.837 sec, 54.4 MB/sec Line-at-a-time, :raw:perlio:utf8 100.0 MB in 2.021 sec, 49.5 MB/sec Line-at-a-time, :win32:perlio 100.0 MB in 1.805 sec, 55.4 MB/sec Slurp-into-scalar, default layers 100.0 MB in 0.182 sec, 550.1 MB/sec Slurp-into-scalar, :raw:perlio 100.0 MB in 0.065 sec, 1548.0 MB/sec Slurp-into-scalar, :raw:perlio:utf8 100000000 on disk, 99999476 in memory 100.0 MB in 0.129 sec, 778.1 MB/sec Slurp into scalar with sysopen/sysread (single read) 100.0 MB in 0.034 sec, 2976.2 MB/sec

Here's the code. Yes, pretty crude, but it was enough to tell me what I was doing wrong - PerlIO is the win.

The ridiculously large numbers are because the file gets into the Win32 file cache and stays there. That's actually a plus for my benchmark because it shows me where my bottlenecks are. The large sysread numbers are because no postprocessing is being done, e.g. breaking the file up into lines. Since 55 MB/sec is enough for me at the moment, I'm not looking at writing my own buffering/line processing code just yet.

But it also shows that perlio is imposing a tax compared to pure sysread. So maybe someday I'll look at the PerlIO code and see if there's some useful optimizations that won't pessimize something else.

#!/usr/bin/perl use strict; use warnings; use utf8; use Fcntl qw(); use Time::HiRes qw(); my $testfile = shift or die "Specify a test file"; die "$testfile doesn't exist" unless -f $testfile; my @benchmarks = ( \&bench1_native, # \&bench1_raw, # \&bench1_mmap, \&bench1_raw_perlio, \&bench1_raw_perlio_utf8, \&bench1_win32, \&bench2_native, \&bench2_raw_perlio, \&bench2_raw_perlio_utf8, \&bench3 ); foreach my $bench (@benchmarks) { my ($secs, $bytes, $lines) = $bench->($testfile); my $mb = $bytes / 1_000_000; print sprintf(" %.1f MB in %.3f sec, %.1f MB/sec\n", $mb, $secs, +$mb / $secs); print sprintf(" %1.fK lines, %.2f KL/sec\n", $lines / 1_000, ($li +nes / 1_000) / $secs) if defined($lines); } # ------------------------------------------------------------------ # Read a line at a time with <fh> sub bench1_native { return bench1_common(@_, "Line-at-a-time, default +layers", "<"); } sub bench1_raw { return bench1_common(@_, "Line-at-a-time, :raw", "<:r +aw"); } sub bench1_mmap { return bench1_common(@_, "Line-at-a-time, :raw:mmap" +, "<:raw:mmap"); } sub bench1_raw_perlio { return bench1_common(@_, "Line-at-a-time, :raw +:perlio", "<:raw:perlio"); } sub bench1_raw_perlio_utf8 { return bench1_common(@_, "Line-at-a-time, + :raw:perlio:utf8", "<:raw:perlio:utf8"); } sub bench1_win32 { return bench1_common(@_, "Line-at-a-time, :win32:pe +rlio", "<:win32:perlio"); } sub bench1_common { my ($file, $prompt, $discipline) = @_; print "\n$prompt\n"; open(my $fh, $discipline, $file) or die; my $size = -s $fh; # my $lines = 0; my $bytes = 0; my $start_time = Time::HiRes::time(); while (<$fh>) { use bytes; # $lines += 1; $bytes += length($_); } my $end_time = Time::HiRes::time(); close($fh); print " $size on disk, $bytes in memory\n" if $bytes != $size; my $secs = $end_time - $start_time; # return ($secs, $size, $lines); return ($secs, $size); } # ------------------------------------------------------------------ sub bench2_native { return bench2_common(@_, "Slurp-into-scalar, defau +lt layers", "<"); } sub bench2_raw_perlio { return bench2_common(@_, "Slurp-into-scalar, : +raw:perlio", "<:raw:perlio"); } sub bench2_raw_perlio_utf8 { return bench2_common(@_, "Slurp-into-scal +ar, :raw:perlio:utf8", "<:raw:perlio:utf8"); } # Read whole file with <fh> sub bench2_common { my ($file, $prompt, $discipline) = @_; print "\n$prompt\n"; open(my $fh, $discipline, $file) or die; my $size = -s $fh; local $/ = undef; my $buf = ""; vec($buf, $size, 8) = 0; my $start_time = Time::HiRes::time(); $buf = <$fh>; my $end_time = Time::HiRes::time(); close($fh); my $bufsize = length($buf); # die "file is $size but got $bufsize" unless $bufsize == $size; print " $size on disk, $bufsize in memory\n" if $bufsize != $size +; my $secs = $end_time - $start_time; return ($secs, $size); } # ------------------------------------------------------------------ # Read whole file with sysopen/sysread sub bench3 { my ($file) = @_; print "\n"; print "Slurp into scalar with sysopen/sysread (single read)\n"; sysopen(my $fh, $file, Fcntl::O_RDONLY | Fcntl::O_BINARY) or die; my $size = -s $fh; local $/ = undef; my $buf = ""; vec($buf, $size, 8) = 0; my $start_time = Time::HiRes::time(); my $count = sysread($fh, $buf, $size); my $end_time = Time::HiRes::time(); die "read error: $!" unless defined($count) && $count == $size; close($fh); my $bufsize = length($buf); die "file is $size but got $bufsize" unless $bufsize == $size; my $secs = $end_time - $start_time; return ($secs, $size); }