It is now my 4:11 am early morning, make it crystal clear, I am not back for some heated discussion. I am back to clear up the water, and demo why slurp a file is not a good practice. The moment I got the idea, I felt obligatory to test it out, and share the result with monks.
We all read the bench mark from
demerphq. His data clearly showed us that to slurp a file by reading in as one line and then split is much faster than all other solutions.
But what does that mean? I has been thinking about that since I saw his result. I couldn't really get it until the moment that I told myself: yes, it only means that the reading/slurping ITSELF is fast, but hold the file in memory could seriously slow down the application as a whole, and make the speed gained at the slurping time not just nothing, but something will bite us. Then I decided to design a test case, to demo this.
All what I need to do, is to cause PAGING. To do that, I don't really need some huge file, I only need it big enough, so that together with other parts of the application, it would cause PAGING.
This time I decided to come back with SOLID DATA, not just like what I did last time, with blahblah... ;-)
I first wrote this piece to prepare my testing data:
test_pre.pl:
use strict;
open(DATA, ">", "test.dat");
foreach (1..$ARGV[0]) {
print DATA "$_\n";
}
close(DATA);
Then I made up two pieces of simple program, one to slurp with split (test_slurp.pl) and then DO SOMETHING, when the other one read in the file line by line (test.pl) and then DO EXACTLY THE SOMETHING. The whole point is DO SOMETHING, not just read, one would NEVER read a file without using it. Also this SOMETHING has to be simple, straight, and to be something that could happen EVERY day:
test_slurp.pl:
use strict;
use constant SIZE => 10000;
my $t0 = time();
my $lines = read_log_s();
my @data = (0..SIZE - 1);
foreach my $line (@{$lines}) {
$line += ($data[$line % SIZE] - $data[($line + 1) % SIZE]);
}
print "Used ", time() - $t0, "\n";
sub read_log_s {
local $/;
open(FILE, "<", "test.dat");
my @lines = split /\n/, <FILE>;
close(FILE);
return \@lines;
}
test.pl:
use strict;
use constant SIZE => 10000;
my $t0 = time();
my @data = (0..SIZE - 1);
open(FILE, "<", "test.dat");
my $line;
while ($line = <FILE>) {
$line += ($data[$line % SIZE] - $data[($line + 1) % SIZE]);
}
close(FILE);
print "Used ", time() - $t0, "\n";
I tried with
perl -w test_pre.pl 100000
created a file test.dat, which is only 688K. Then I tried both test_slurp.pl and test.pl with this file on my win98. test_slurp used 9, when test only used 2. (Well, my PC is slow, sorry about that ;-)
The whole testing is not something complex, as you can see, it is really simple. The only idea is to DO SOMETHING. As we are now clear, when the slurping itself was much faster, it easily caused the whole application to significantly slow down.
(If you want to repeat my testing, you may need to adjust the numbers to cause PAGING, depending on your PC configuration and OS).
...I have to go back to sleep ;-)
Tested with perl -w test_pre.pl 1000000, file size 8- M, test.pl use 30, test_slurp.pl core dump after running for a while (tried twice). Now I am really leaving... ;-)