Re: optimization in file processing

Reading terabytes of file data is taking time. I tried to read a 2 Gbyte file with the simple perl script

#!/usr/bin/perl

while (<>) {
}
[download]

and it took 39 seconds. Translated to 1 Terabyte this script would need 5 hours.

What part of that is perl and what is the hard disk? When I used

cat twogig.txt > /dev/null
[download]

it still took 25 seconds. Translated to 1 Terabyte that is 3.3 hours. So in my case 2/3 of the time is spent just by reading from disk, the rest can be contributed to not reading large chunks, i.e. the overhead of reading line by line.

Do these tests yourself and you will get the lower limit of what you can hope to achieve without either throwing faster hardware at it or preprocessing the data (if the file doesn't change all the time you might construct the hash on disk and use it more than once)

Comment on Re: optimization in file processing Select or Download Code

Replies are listed 'Best First'.
Re^2: optimization in file processing by moritz (Cardinal) on Jul 08, 2011 at 10:18 UTC
While your consideration is very valuable, I think it also shows that the OP's speed problem likely doesn't stem from IO, at least if his machine behaves similarly as yours. 10 lakhs = 1 million, so to be IO bound his record size would need be in the order of 1 MB. I don't know what data the OP deals with, but typical CSV files have much smaller records. Perl 6 - second systems done right	[reply]
Re^3: optimization in file processing by jethro (Monsignor) on Jul 08, 2011 at 11:57 UTC
He said his file size is in the terabyte range (at least that's what I think his last sentence said). So either he has an atypical CSV file or one of his numbers is wrong.	[reply]