Perl's poor disk IO performance

TROGDOR has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: Perl's poor disk IO performance by moritz (Cardinal) on Apr 29, 2010 at 19:29 UTC
Did you actually verify that it's disk IO that's slow, and not the rest of the program? Devel::NYTProf might help you finding out. That said, there is a way to improve IO speed. Perl's normal open, read and readline functions use IO layers, which you can circumvent by using sysopen and sysread. Perl 6 - links to (nearly) everything that is Perl 6.	[reply]
Re: Perl's poor disk IO performance by BrowserUk (Patriarch) on Apr 29, 2010 at 19:57 UTC
Change `open (FILE, "$path")` to `open (FILE, '<:raw', "$path")`. On my system, with that change, your code reading 10MB, takes .38 seconds.	[reply] [d/l] [select]
Re: Perl's poor disk IO performance by roboticus (Chancellor) on Apr 29, 2010 at 21:33 UTC
TROGDOR A few minor notes: As moritz indicated, you haven't proven that the I/O is the problem. You're doing some operations other than reading, any of which may be the problem. Be sure to profile before assigning blame. BrowserUk and moritz have also shown you a couple tips to improve the reading speed. Perl isn't a "speed is king" language, so having something 1/5 as fast as C isn't altogether unusual. Normally I/O dominates many operations, so perl code is often quite close to C code in wall-clock performance. Keep in mind that while perl is slower than C/C++ at some things, it makes other tasks quite simple to do. So often it makes more sense to compare an entire program, rather than one phase of it (I/O). Just as a reference, I do (in my day job) quite a lot of file processing in both C/C++ and perl. I often find that the perl stuff runs slower, but not enough so that I want to write all my processing programs in C/C++. I find it so much simpler to do regex and complex data munging in perl, that when I have to whack a file using a good bit of intelligence, I tend to use perl first. If I need to whack a large file with just a little simple code and speed is of the essence, I tend to use C/C++. Only rarely do I find that I have to optimize a perl program or rewrite the program to use C/C++. As usual, YMMV. ...roboticus	[reply]
Re: Perl's poor disk IO performance by Marshall (Canon) on Apr 29, 2010 at 23:43 UTC
I was looking at: http://perldoc.perl.org/PerlIO.html. From my reading, :raw is the same as setting binmode(FILE). In the past I've just opened the file and then set binmode with another statement. Both ways apparently result in a buffered stream. The perldoc above suggests, open($fh,"<:unix",$path) as a way to get an unbuffered stream and also has some other interesting info tidbits. Would be curious if: open($fh,"<:unix",$path) produces further speed improvements past :raw? You didn't post the C code, so I'm not 100% sure that we have an "apples to apples" comparison here - there may be some detail that makes this not quite the same. BTW, are you on a Unix or a Windows platform? I don't think that matters, but it might in some weird way that I don't understand right now. I've written binary manipulation stuff in Perl before for doing things like concatenating .wav files together. I wouldn't normally be thinking of Perl for a massive amount of binary number crunching, but it can do it! Most of my code involves working with ASCII and huge amounts of time can get spent in the splitting, match global regex code.. I have one app where 30% of the time is spent doing just that. The raw reading/writing to the disk is usually not an issue in my code as there are other considerations that take a lot of time. Update: see the fine benchmarks from BrowserUk. I appears that :perlio & setting binmode($fh) is the way to go.	[reply]
Re^2: Perl's poor disk IO performance by BrowserUk (Patriarch) on Apr 30, 2010 at 00:20 UTC
That's interesting. As is often the case with Perl, things move (silently) on as new versions appear. I just re-ran a series of tests that I last performed shortly after IO layers were added. Back then, on my system ':raw' was exactly equivalent to using binmode. It no longer is, nor is either the fastest option. Using this: #! perl -sl use Time::HiRes qw[ time ]; our $LAYER //= ':raw'; our $B; $s = time; open (FILE, "<$LAYER", "junk.bin") or die "ERROR: Could not open $path.\n"; binmode FILE if $B; $n=0; while (1) { $eof = read (FILE, $header, 4); ($size, $code, $ftype) = unpack ("nCC", $header) ; # print join(':',$size, $code, $ftype, "\n"); if ($size == 0) { print "Size is zero. Exiting"; last; } $size = $size - 4; if ($size > 0) { $eof = read (FILE, $data, $size); } $n += 4 + $size; } close FILE; print $n; printf "Took %.3f\n", time() - $s; [download] You can see (and interpret) the results for yourself: C:\test>junk41 -LAYER= Size is zero. Exiting 50450063 Took 0.648 C:\test>junk41 -LAYER= Size is zero. Exiting 50450063 Took 0.652 C:\test>junk41 -LAYER= -B Size is zero. Exiting 50450063 Took 0.305 C:\test>junk41 -LAYER= -B Size is zero. Exiting 50450063 Took 0.289 C:\test>junk41 -LAYER=:raw Size is zero. Exiting 50450063 Took 0.803 C:\test>junk41 -LAYER=:raw Size is zero. Exiting 50450063 Took 0.794 C:\test>junk41 -LAYER=:raw -B Size is zero. Exiting 50450063 Took 0.813 C:\test>junk41 -LAYER=:raw -B Size is zero. Exiting 50450063 Took 0.798 C:\test>junk41 -LAYER=:unix Size is zero. Exiting 50450063 Took 0.812 C:\test>junk41 -LAYER=:unix Size is zero. Exiting 50450063 Took 0.802 C:\test>junk41 -LAYER=:unix -B Size is zero. Exiting 50450063 Took 0.803 C:\test>junk41 -LAYER=:unix -B Size is zero. Exiting 50450063 Took 0.852 C:\test>junk41 -LAYER=:perlio Size is zero. Exiting 50450063 Took 0.330 C:\test>junk41 -LAYER=:perlio Size is zero. Exiting 50450063 Took 0.337 C:\test>junk41 -LAYER=:perlio -B Size is zero. Exiting 50450063 Took 0.305 C:\test>junk41 -LAYER=:perlio -B Size is zero. Exiting 50450063 Took 0.262 [download] On my system, I'll be using :perlio & binmode for fast binary access from now on. (Until it changes again:) Perhaps even more indicative of the lag in the documentation is this: `C:\test>junk41 -LAYER=:crlf Size is zero. Exiting 50466132 Took 0.668 C:\test>junk41 -LAYER=:crlf -B Size is zero. Exiting 50466132 Took 0.283 C:\test>junk41 -LAYER=:crlf:raw Size is zero. Exiting 50466132 Took 0.815 C:\test>junk41 -LAYER=:crlf:raw -B Size is zero. Exiting 50466132 Took 0.845` [download] If `:raw` popped all layers that were incompatible with binary reading, then `:crlf:raw` should be as fast as `:crlf` + binmode. But it ain't! Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. "Science is about questioning the status quo. Questioning authority". In the absence of evidence, opinion is indistinguishable from prejudice. RIP an inspiration; A true Folk's Guy	[reply] [d/l] [select]
Re^3: Perl's poor disk IO performance by Marshall (Canon) on Apr 30, 2010 at 19:27 UTC
Thanks for the very informative benchmarks!	[reply]
Re: Perl's poor disk IO performance by snoopy (Curate) on Apr 29, 2010 at 23:09 UTC
Are you really wanting to skip over records as per your example above? Memory mapping can be a better choice, if I/O has been identified as a bottleneck and you want 'semi-random' access to you data. I.e. if you can be a bit selective, skipping records, based on the headers and thus skipping significant blocks of data. For example, the following uses Sys::Mmap: #!/usr/bin/perl use common::sense; use Sys::Mmap; my $path = '/tmp/stuff'; my $file_size = -s $path; die "empty or missing file: $path" unless $file_size; open (my $fh, '+<', $path) or die "unable to open $path for read: $!"; mmap(my $data, 0, PROT_READ, MAP_SHARED, $fh) or die "mmap: $!"; my $pos = 0; while ($pos < $file_size) { my ($size, $code, $ftype) = unpack ("nCC", substr($data, $pos, 4)); $pos += 4; # advance past header $size = $size - 4; if ($size > 0) { $pos += $size; # advance past record } } [download] If you've identified I/O as a bottleneck, it's worthwhile benchmarking this against your above solution anyway, even if you are reading sequentially. It'll help to determine if read really is imposing a performance penalty!	[reply] [d/l]
Re: Perl's poor disk IO performance by Anonymous Monk on Dec 31, 2010 at 00:20 UTC
I hope this reply isn't too long, but I'm a fan of disclosing source code to back up results. The short of it is that I get 50 MB/sec when processing files line/by/line in Perl. Perl file performance is near and dear to my heart, since I routinely work on multi-gigabyte files. I wrote a benchmark program a little while ago to help me to stay with Perl, because performance was tempting me to go to C++ or to bypass Perl's buffering and do it myself (with large sysread calls). I ran this on a file that was exactly 100 MB long, with lots of small lines, so a somewhat worst-case for a naive line-at-a-time approach. This is a UTF-8 file, and I was particularly interested to figure out why my unicode-file reading was so pitifully slow on a Windows machine. So my fix was to start specifying ":raw:perlio:utf8" on my file handles, and I got a 6x improvement in speed. Line-at-a-time, default layers 100.0 MB in 12.012 sec, 8.3 MB/sec Line-at-a-time, :raw:perlio 100.0 MB in 1.837 sec, 54.4 MB/sec Line-at-a-time, :raw:perlio:utf8 100.0 MB in 2.021 sec, 49.5 MB/sec Line-at-a-time, :win32:perlio 100.0 MB in 1.805 sec, 55.4 MB/sec Slurp-into-scalar, default layers 100.0 MB in 0.182 sec, 550.1 MB/sec Slurp-into-scalar, :raw:perlio 100.0 MB in 0.065 sec, 1548.0 MB/sec Slurp-into-scalar, :raw:perlio:utf8 100000000 on disk, 99999476 in memory 100.0 MB in 0.129 sec, 778.1 MB/sec Slurp into scalar with sysopen/sysread (single read) 100.0 MB in 0.034 sec, 2976.2 MB/sec [download] Here's the code. Yes, pretty crude, but it was enough to tell me what I was doing wrong - PerlIO is the win. The ridiculously large numbers are because the file gets into the Win32 file cache and stays there. That's actually a plus for my benchmark because it shows me where my bottlenecks are. The large sysread numbers are because no postprocessing is being done, e.g. breaking the file up into lines. Since 55 MB/sec is enough for me at the moment, I'm not looking at writing my own buffering/line processing code just yet. But it also shows that perlio is imposing a tax compared to pure sysread. So maybe someday I'll look at the PerlIO code and see if there's some useful optimizations that won't pessimize something else. #!/usr/bin/perl use strict; use warnings; use utf8; use Fcntl qw(); use Time::HiRes qw(); my $testfile = shift or die "Specify a test file"; die "$testfile doesn't exist" unless -f $testfile; my @benchmarks = ( \&bench1_native, # \&bench1_raw, # \&bench1_mmap, \&bench1_raw_perlio, \&bench1_raw_perlio_utf8, \&bench1_win32, \&bench2_native, \&bench2_raw_perlio, \&bench2_raw_perlio_utf8, \&bench3 ); foreach my $bench (@benchmarks) { my ($secs, $bytes, $lines) = $bench->($testfile); my $mb = $bytes / 1_000_000; print sprintf(" %.1f MB in %.3f sec, %.1f MB/sec\n", $mb, $secs, +$mb / $secs); print sprintf(" %1.fK lines, %.2f KL/sec\n", $lines / 1_000, ($li +nes / 1_000) / $secs) if defined($lines); } # ------------------------------------------------------------------ # Read a line at a time with <fh> sub bench1_native { return bench1_common(@_, "Line-at-a-time, default +layers", "<"); } sub bench1_raw { return bench1_common(@_, "Line-at-a-time, :raw", "<:r +aw"); } sub bench1_mmap { return bench1_common(@_, "Line-at-a-time, :raw:mmap" +, "<:raw:mmap"); } sub bench1_raw_perlio { return bench1_common(@_, "Line-at-a-time, :raw +:perlio", "<:raw:perlio"); } sub bench1_raw_perlio_utf8 { return bench1_common(@_, "Line-at-a-time, + :raw:perlio:utf8", "<:raw:perlio:utf8"); } sub bench1_win32 { return bench1_common(@_, "Line-at-a-time, :win32:pe +rlio", "<:win32:perlio"); } sub bench1_common { my ($file, $prompt, $discipline) = @_; print "\n$prompt\n"; open(my $fh, $discipline, $file) or die; my $size = -s $fh; # my $lines = 0; my $bytes = 0; my $start_time = Time::HiRes::time(); while (<$fh>) { use bytes; # $lines += 1; $bytes += length($_); } my $end_time = Time::HiRes::time(); close($fh); print " $size on disk, $bytes in memory\n" if $bytes != $size; my $secs = $end_time - $start_time; # return ($secs, $size, $lines); return ($secs, $size); } # ------------------------------------------------------------------ sub bench2_native { return bench2_common(@_, "Slurp-into-scalar, defau +lt layers", "<"); } sub bench2_raw_perlio { return bench2_common(@_, "Slurp-into-scalar, : +raw:perlio", "<:raw:perlio"); } sub bench2_raw_perlio_utf8 { return bench2_common(@_, "Slurp-into-scal +ar, :raw:perlio:utf8", "<:raw:perlio:utf8"); } # Read whole file with <fh> sub bench2_common { my ($file, $prompt, $discipline) = @_; print "\n$prompt\n"; open(my $fh, $discipline, $file) or die; my $size = -s $fh; local $/ = undef; my $buf = ""; vec($buf, $size, 8) = 0; my $start_time = Time::HiRes::time(); $buf = <$fh>; my $end_time = Time::HiRes::time(); close($fh); my $bufsize = length($buf); # die "file is $size but got $bufsize" unless $bufsize == $size; print " $size on disk, $bufsize in memory\n" if $bufsize != $size +; my $secs = $end_time - $start_time; return ($secs, $size); } # ------------------------------------------------------------------ # Read whole file with sysopen/sysread sub bench3 { my ($file) = @_; print "\n"; print "Slurp into scalar with sysopen/sysread (single read)\n"; sysopen(my $fh, $file, Fcntl::O_RDONLY \| Fcntl::O_BINARY) or die; my $size = -s $fh; local $/ = undef; my $buf = ""; vec($buf, $size, 8) = 0; my $start_time = Time::HiRes::time(); my $count = sysread($fh, $buf, $size); my $end_time = Time::HiRes::time(); die "read error: $!" unless defined($count) && $count == $size; close($fh); my $bufsize = length($buf); die "file is $size but got $bufsize" unless $bufsize == $size; my $secs = $end_time - $start_time; return ($secs, $size); } [download]	[reply] [d/l] [select]
Re: Perl's poor disk IO performance by kikuchiyo (Hermit) on Apr 30, 2010 at 14:26 UTC
You could try slurping the file in one go instead of reading them in 4 byte increments. In my experience this can speed up processing. Unless, of course, your .gds files are so big that they don't fit in memory.	[reply]