RE (tilly) 2 (blame): File reading efficiency and other surly remarks

Replies are listed 'Best First'.
RE: RE (tilly) 2 (blame): File reading efficiency and other surly remarks by tye (Sage) on Aug 26, 2000 at 08:10 UTC
Good point. Since I also mentioned reading chunks at a time, I'll emphasize that this is not a good idea if you are going to split each chunk into lines. When you use Perl's `<FILE>`, Perl itself is reading the file as chunks and splitting them into lines to give to you. I can assure you that you can't do this faster in Perl code than Perl can do it itself. And the Perl code has been tested a lot more than any similar code you might write. Yes, Tilly already said all of this. I just didn't think he said it strong enough (and I felt guilty for suggesting chunk-by-chunk after not fully understanding a previous reply). - tye (but my friends call me "Tye")	[reply] [d/l]
RE: RE: RE (tilly) 2 (blame): File reading efficiency and other surly remarks by lhoward (Vicar) on Aug 26, 2000 at 17:41 UTC
I have done some rather benchmarking of "line at a time" vs. "chunk at a time with manual split into lines" vs. "line at a time w/ lots of buffering". "block at a time with manual split into lines" is clearly the fastest by almost 2 to 1 over the other 2 methods. I've included my benchmarking program and results below: `Benchmark: running BufferedFileHandle, chunk, linebyline, each for at +least 3 CP U seconds... BufferedFileHandle: 3 wallclock secs ( 3.22 usr + 0.08 sys = 3.30 C +PU) @ <b>2.73/s</b> (n=9) chunk: 4 wallclock secs ( 2.89 usr + 0.32 sys = 3.21 CPU) @ < +b>4.36/s</b> (n=14) linebyline: 4 wallclock secs ( 3.25 usr + 0.06 sys = 3.31 CPU) @ < +b>2.72/s</b> (n=9)` [download] #!/usr/bin/perl use Benchmark; use strict; use FileHandle; timethese(0, { 'linebyline' => \&linebyline, 'chunk' => \&chunk , 'BufferedFileHandle' => \&BufferedFileHandle }); sub linebyline { open(FILE, "file"); while(<FILE>) { } close(FILE); } sub chunk { my($buf, $leftover, @lines); open(FILE, "file"); while(read FILE, $buf, 641024) { $buf = $leftover.$buf; @lines = split(/\n/, $buf); $leftover = ($buf !~ /\n$/) ? pop @lines : ""; foreach (@lines) { } } close(FILE); } sub BufferedFileHandle{ my $fh=new FileHandle; my $buffer_var; $fh->open("file"); $fh->setvbuf($buffer_var, _IOLBF, 641024); while(<$fh>) { } close(FILE); } [download] I'd be very interested to see your results that show diffrently. Edit to replace CODE tags for PRE tags around long lines	[reply] [d/l] [select]
RE (tilly) 5: File reading efficiency and other surly remarks by tilly (Archbishop) on Aug 26, 2000 at 18:37 UTC
While demonstrating one point is wrong (and again making it clear that until you benchmark, you don't really know what is faster), you demonstrate the other. What happens in your chunk code with the last line? Which is more code? And when you are doing fixing that you may still be twice as fast but with quite a bit more (and harder to read) code. Going forward that is more to maintain. I would strongly argue against this optimization (which I think might well give different results on different operating systems) until after your system is built and performance is known to be a problem. One note though. The IO* modules generally have significant overhead and I don't recommend using them. EDIT Another bug. You used split in the chunk method without the third argument. Should your block land at the start of a paragraph, you would incorrectly lose lines!	[reply]
RE: RE (tilly) 5: File reading efficiency and other surly remarks by lhoward (Vicar) on Aug 26, 2000 at 21:17 UTC
RE (tilly) 7 (See other comments): File reading efficiency and other surly remarks by tilly (Archbishop) on Aug 26, 2000 at 21:45 UTC
Some notes below your chosen depth have not been shown here
RE: RE: RE: RE (tilly) 2 (blame): File reading efficiency and other surly remarks by tye (Sage) on Aug 26, 2000 at 19:57 UTC
I'd be very interested to see your results that show diffrently. Cut, paste, copy Chatter.bat (33KB) to "file", run: `Benchmark: running BufferedFileHandle, chunk, linebyline, each for at +least 3 CPU seconds... BufferedFileHandle: 4 wallclock secs ( 3.46 usr + 0.00 sys = 3.46 C +PU) @ 386.13/s (n=1336) chunk: 4 wallclock secs ( 3.63 usr + 0.00 sys = 3.63 CPU) @ 31 +0.19/s (n=1126) linebyline: 4 wallclock secs ( 3.40 usr + 0.00 sys = 3.40 CPU) @ 43 +4.71/s (n=1478)` [download] This shows that default line-by-line is the fastest (434/s), enlarged buffer line-by-line is the 2nd fastest (386/s), and chunk and split is the slowest (310/s). Now append Chatter.bat to "file" until we have a 1GB file. Now we have buffered@15/s, line-by-line@13/s, chuck@9/s. Find 85MB file: buffered@0.20/s, line-by-line@0.19/s, chunk@0.12/s. I'd personally consider perl broken if it couldn't read a line at a time faster than I could in Perl code. Previous benchmarks have shown that Perl's overriding of stdio buffers can make perl's I/O faster than I/O in C programs using stdio. So I must be missing something about (at least) your copy of perl to understand why standard line-by-line isn't faster. Update: I removed a pointless sentence that was probably snide. I apologize to those who already read it. - tye (but my friends call me "Tye")	[reply] [d/l]
RE (tilly) 6 (bench): File reading efficiency and other surly remarks by tilly (Archbishop) on Aug 26, 2000 at 21:35 UTC
RE: RE: RE: RE: RE (tilly) 2 (blame): File reading efficiency and other surly remarks by lhoward (Vicar) on Aug 26, 2000 at 21:20 UTC
RE6 (tye): File reading efficiency and other surly remarks by tye (Sage) on Aug 26, 2000 at 21:30 UTC
Some notes below your chosen depth have not been shown here