Speed reading (files)

kwaping has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: Speed reading (files) by BrowserUk (Patriarch) on Aug 04, 2005 at 16:06 UTC
With read, you are telling it how big the file is, so it can preallocate a buffer to that size and fill it in a single call to the system (even if the system chooses to break it into smaller reads from disk). So, one call to the system to allocate the memory. One call to perform the read. With the traditional slurp, it doesn't know how big the file is, so it allocates a buffer (probably 4k), calls the systems to read 4k, checks to see if 4k was read, if not, finished. if so, reallocates the buffer to +4k (which probably means the first 4k gets copied). repeat from step 2. Try using: `my $data = do{ local $/ = \( -s FILE ); <FILE> };;` [download] Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. Lingua non convalesco, consenesco et abolesco. -- Rule 1 has a caveat! -- Who broke the cabal? "Science is about questioning the status quo. Questioning authority". The "good enough" maybe good enough for the now, and perfection maybe unobtainable, but that should not preclude us from striving for perfection, when time, circumstance or desire allow.	[reply] [d/l]
Re: Speed reading (files) by blazar (Canon) on Aug 04, 2005 at 15:38 UTC
Not that this really did ever care much in most practical situations I've faced myself (so that I most commonly stick with `local $/;`), but I have often seen mentioned that File::Slurp provides much optimized file slurping facilities.	[reply] [d/l]
Re^2: Speed reading (files) by kwaping (Priest) on Aug 04, 2005 at 15:49 UTC
I too have read about File::Slurp and how it's supposed to be super fast. However, check this out: `using read function 0.684900999069214 0.681570053100586 0.680304050445557 0.675194025039673 0.684563159942627 0.686581134796143 File::Slurp 1.57559299468994 1.5706889629364 1.5739688873291 1.5618691444397 1.56290698051453 1.58691692352295` [download] That's using the same test as above, replacing `open(FILE,"<$file") \|\| die $!; my $data = do { local $/; <FILE> }; close(FILE);` [download] with `my $data = read_file($file);` [download]	[reply] [d/l] [select]
Re^3: Speed reading (files) by xdg (Monsignor) on Aug 04, 2005 at 16:14 UTC
Have you seen the article in the File::Slurp distribution? At it's core, it's doing something very similar to the "read" approach. I'd guess the speed difference you're seeing is due to the option/error/context checking that it's doing, which your hand-coded example is not. -xdg Code written by xdg and posted on PerlMonks is public domain. It is provided as is with no warranties, express or implied, of any kind. Posted code may not have been tested. Use of posted code is at your own risk.	[reply]
Re: Speed reading (files) by izut (Chaplain) on Aug 04, 2005 at 16:25 UTC
I did this test: `use Benchmark qw(:all); use File::Slurp; $file_slurp = sub { $file = read_file('arq.pdf', binmode => ':raw') }; $read = sub { open $file, '<', 'arq.pdf' \|\| die $!; read $file, +$data, -s 'arq.pdf'; close $file; }; timethese(1000, { file_slurp => $file_slurp, read => $read, }); cmpthese(1000, { file_slurp => $file_slurp, read => $read, });` [download] I got this results: `Benchmark: timing 1000 iterations of file_slurp, read... file_slurp: 1 wallclock secs ( 0.43 usr + 0.31 sys = 0.74 CPU) @ 13 +51.35/s (n=1000) read: 6 wallclock secs ( 4.88 usr + 0.44 sys = 5.32 CPU) @ 18 +7.97/s (n=1000) Rate read file_slurp read 192/s -- -86% file_slurp 1333/s 595% --` [download] I see that file_slurp in this case is the best option. I've used a 80k file with 1000 lines and 79 columns. Using a 2.5Mb pdf file the results were different: `file_slurp: 44 wallclock secs (15.99 usr + 26.40 sys = 42.39 CPU) @ 23 +.59/s (n=1000) read: 23 wallclock secs ( 9.12 usr + 13.67 sys = 22.79 CPU) @ 43 +.88/s (n=1000) Rate file_slurp read file_slurp 23.6/s -- -46% read 43.6/s 85% --` [download] I this case File::Slurp was slower than read(). I see that File::Slurp has a binopen mode. After changing to binmode: `file_slurp: 46 wallclock secs (16.01 usr + 26.25 sys = 42.26 CPU) @ 23 +.66/s (n=1000) read: 25 wallclock secs ( 9.00 usr + 13.91 sys = 22.91 CPU) @ 43 +.65/s (n=1000) Rate file_slurp read file_slurp 23.7/s -- -46% read 43.5/s 84% --` [download] I think in bigger files read() has better performance than File::Slurp... I don't know why but I hope it helps :-) Igor S. Lopes - izut surrender to perl. your code, your rules.	[reply] [d/l] [select]
Re: Speed reading (files) by anonymized user 468275 (Curate) on Aug 04, 2005 at 15:52 UTC
The slurp is not as ravenous as its nickname suggests. Although it appears even during debug to slurp in one go, behind the scenes the <> operator is performing I/O system calls that are consistent with reading by delimiter; that in spite of your 'local $/'. On the other hand, read's buffers are of an optimised size. As a rule, the larger the buffer size of each actual I/O system call, the fewer such calls have to be performed and the more efficiently the whole file is read in. One world, one people	[reply]
Re: Speed reading (files) by polettix (Vicar) on Aug 04, 2005 at 17:10 UTC
The repetition of the test makes it highly probable that the file is cached in RAM. On one side, this means that the results you get are probably indicative of the real performance of the software portion of the reading process, i.e. the comparison is meaningful to some extent. OTOH, real-world situations in which you have to read gigabytes of data will probably suffer a bottleneck from the actual reading from the device, so the differences would probably disappear. The interesting thing is in your answer to the node you pointed. You do this because that's how you learned how to do it - because that's how you see how to do it almost everywhere (apart from the use of File::Slurp). This is no cargo cult IMHO, but a useful Perl-level idiom to say "slurp that file"; to this extent, using `read` would be way too "low-level"! Flavio perl -ple'$_=reverse' <<<ti.xittelop@oivalf Don't fool yourself.	[reply] [d/l]
Re: Speed reading (files) by ikegami (Patriarch) on Aug 04, 2005 at 17:39 UTC
IIRC, `open(FILE,"<$file") \|\| die $!; my $data = do { local $/; <FILE> }; close(FILE);` [download] is slower than `open(FILE,"<$file") \|\| die $!; my $data; { local $/; $data = <FILE> } close(FILE);` [download] which is the true traditional slurp.	[reply] [d/l] [select]
Re^2: Speed reading (files) by kwaping (Priest) on Aug 04, 2005 at 17:48 UTC
Read is still faster, but the difference is very minute. Using read is still no worse than the true traditional slurp, which I find interesting. `using read function 0.614683866500854 0.604650020599365 0.616497039794922 0.640319108963013 0.603079795837402 0.605829000473022 ikegami's code 0.672801971435547 0.667882919311523 0.666944980621338 0.670393943786621 0.712579011917114 0.767023086547852` [download]	[reply] [d/l]
Re^3: Speed reading (files) by ikegami (Patriarch) on Aug 04, 2005 at 19:58 UTC
The advantage of "slurping" is that it works with ttys. I don't think `-s` does. Also, is `read` guaranteed to return the number of bytes requested (if the file is big enough)? `sysread` isn't. As an aside, wouldn't `read FILE, my $data, $MAX_FILE_SIZE;` be faster than using `-s`, especially on smaller files? I guess you have to find an acceptable `$MAX_FILE_SIZE` (in Config.pm, maybe???) By the way, I'm not suprised that `read` is as fast or faster than "slurping". Why wouldn't it be?	[reply] [d/l] [select]
Re^4: Speed reading (files) by BrowserUk (Patriarch) on Aug 05, 2005 at 04:30 UTC
Re^5: Speed reading (files) by ikegami (Patriarch) on Aug 05, 2005 at 05:12 UTC
Some notes below your chosen depth have not been shown here