Beefy Boxes and Bandwidth Generously Provided by pair Networks
Come for the quick hacks, stay for the epiphanies.

comment on

( #3333=superdoc: print w/replies, xml ) Need Help??
I hope this reply isn't too long, but I'm a fan of disclosing source code to back up results.

The short of it is that I get 50 MB/sec when processing files line/by/line in Perl.

Perl file performance is near and dear to my heart, since I routinely work on multi-gigabyte files. I wrote a benchmark program a little while ago to help me to stay with Perl, because performance was tempting me to go to C++ or to bypass Perl's buffering and do it myself (with large sysread calls).

I ran this on a file that was exactly 100 MB long, with lots of small lines, so a somewhat worst-case for a naive line-at-a-time approach. This is a UTF-8 file, and I was particularly interested to figure out why my unicode-file reading was so pitifully slow on a Windows machine.

So my fix was to start specifying ":raw:perlio:utf8" on my file handles, and I got a 6x improvement in speed.

Line-at-a-time, default layers 100.0 MB in 12.012 sec, 8.3 MB/sec Line-at-a-time, :raw:perlio 100.0 MB in 1.837 sec, 54.4 MB/sec Line-at-a-time, :raw:perlio:utf8 100.0 MB in 2.021 sec, 49.5 MB/sec Line-at-a-time, :win32:perlio 100.0 MB in 1.805 sec, 55.4 MB/sec Slurp-into-scalar, default layers 100.0 MB in 0.182 sec, 550.1 MB/sec Slurp-into-scalar, :raw:perlio 100.0 MB in 0.065 sec, 1548.0 MB/sec Slurp-into-scalar, :raw:perlio:utf8 100000000 on disk, 99999476 in memory 100.0 MB in 0.129 sec, 778.1 MB/sec Slurp into scalar with sysopen/sysread (single read) 100.0 MB in 0.034 sec, 2976.2 MB/sec

Here's the code. Yes, pretty crude, but it was enough to tell me what I was doing wrong - PerlIO is the win.

The ridiculously large numbers are because the file gets into the Win32 file cache and stays there. That's actually a plus for my benchmark because it shows me where my bottlenecks are. The large sysread numbers are because no postprocessing is being done, e.g. breaking the file up into lines. Since 55 MB/sec is enough for me at the moment, I'm not looking at writing my own buffering/line processing code just yet.

But it also shows that perlio is imposing a tax compared to pure sysread. So maybe someday I'll look at the PerlIO code and see if there's some useful optimizations that won't pessimize something else.

#!/usr/bin/perl use strict; use warnings; use utf8; use Fcntl qw(); use Time::HiRes qw(); my $testfile = shift or die "Specify a test file"; die "$testfile doesn't exist" unless -f $testfile; my @benchmarks = ( \&bench1_native, # \&bench1_raw, # \&bench1_mmap, \&bench1_raw_perlio, \&bench1_raw_perlio_utf8, \&bench1_win32, \&bench2_native, \&bench2_raw_perlio, \&bench2_raw_perlio_utf8, \&bench3 ); foreach my $bench (@benchmarks) { my ($secs, $bytes, $lines) = $bench->($testfile); my $mb = $bytes / 1_000_000; print sprintf(" %.1f MB in %.3f sec, %.1f MB/sec\n", $mb, $secs, +$mb / $secs); print sprintf(" %1.fK lines, %.2f KL/sec\n", $lines / 1_000, ($li +nes / 1_000) / $secs) if defined($lines); } # ------------------------------------------------------------------ # Read a line at a time with <fh> sub bench1_native { return bench1_common(@_, "Line-at-a-time, default +layers", "<"); } sub bench1_raw { return bench1_common(@_, "Line-at-a-time, :raw", "<:r +aw"); } sub bench1_mmap { return bench1_common(@_, "Line-at-a-time, :raw:mmap" +, "<:raw:mmap"); } sub bench1_raw_perlio { return bench1_common(@_, "Line-at-a-time, :raw +:perlio", "<:raw:perlio"); } sub bench1_raw_perlio_utf8 { return bench1_common(@_, "Line-at-a-time, + :raw:perlio:utf8", "<:raw:perlio:utf8"); } sub bench1_win32 { return bench1_common(@_, "Line-at-a-time, :win32:pe +rlio", "<:win32:perlio"); } sub bench1_common { my ($file, $prompt, $discipline) = @_; print "\n$prompt\n"; open(my $fh, $discipline, $file) or die; my $size = -s $fh; # my $lines = 0; my $bytes = 0; my $start_time = Time::HiRes::time(); while (<$fh>) { use bytes; # $lines += 1; $bytes += length($_); } my $end_time = Time::HiRes::time(); close($fh); print " $size on disk, $bytes in memory\n" if $bytes != $size; my $secs = $end_time - $start_time; # return ($secs, $size, $lines); return ($secs, $size); } # ------------------------------------------------------------------ sub bench2_native { return bench2_common(@_, "Slurp-into-scalar, defau +lt layers", "<"); } sub bench2_raw_perlio { return bench2_common(@_, "Slurp-into-scalar, : +raw:perlio", "<:raw:perlio"); } sub bench2_raw_perlio_utf8 { return bench2_common(@_, "Slurp-into-scal +ar, :raw:perlio:utf8", "<:raw:perlio:utf8"); } # Read whole file with <fh> sub bench2_common { my ($file, $prompt, $discipline) = @_; print "\n$prompt\n"; open(my $fh, $discipline, $file) or die; my $size = -s $fh; local $/ = undef; my $buf = ""; vec($buf, $size, 8) = 0; my $start_time = Time::HiRes::time(); $buf = <$fh>; my $end_time = Time::HiRes::time(); close($fh); my $bufsize = length($buf); # die "file is $size but got $bufsize" unless $bufsize == $size; print " $size on disk, $bufsize in memory\n" if $bufsize != $size +; my $secs = $end_time - $start_time; return ($secs, $size); } # ------------------------------------------------------------------ # Read whole file with sysopen/sysread sub bench3 { my ($file) = @_; print "\n"; print "Slurp into scalar with sysopen/sysread (single read)\n"; sysopen(my $fh, $file, Fcntl::O_RDONLY | Fcntl::O_BINARY) or die; my $size = -s $fh; local $/ = undef; my $buf = ""; vec($buf, $size, 8) = 0; my $start_time = Time::HiRes::time(); my $count = sysread($fh, $buf, $size); my $end_time = Time::HiRes::time(); die "read error: $!" unless defined($count) && $count == $size; close($fh); my $bufsize = length($buf); die "file is $size but got $bufsize" unless $bufsize == $size; my $secs = $end_time - $start_time; return ($secs, $size); }

In reply to Re: Perl's poor disk IO performance by Anonymous Monk
in thread Perl's poor disk IO performance by TROGDOR

Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post; it's "PerlMonks-approved HTML":

  • Are you posting in the right place? Check out Where do I post X? to know for sure.
  • Posts may use any of the Perl Monks Approved HTML tags. Currently these include the following:
    <code> <a> <b> <big> <blockquote> <br /> <dd> <dl> <dt> <em> <font> <h1> <h2> <h3> <h4> <h5> <h6> <hr /> <i> <li> <nbsp> <ol> <p> <small> <strike> <strong> <sub> <sup> <table> <td> <th> <tr> <tt> <u> <ul>
  • Snippets of code should be wrapped in <code> tags not <pre> tags. In fact, <pre> tags should generally be avoided. If they must be used, extreme care should be taken to ensure that their contents do not have long lines (<70 chars), in order to prevent horizontal scrolling (and possible janitor intervention).
  • Want more info? How to link or How to display code and escape characters are good places to start.
Log In?

What's my password?
Create A New User
Domain Nodelet?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others meditating upon the Monastery: (6)
As of 2023-11-28 12:58 GMT
Find Nodes?
    Voting Booth?

    No recent polls found