in reply to optimization in file processing

Reading terabytes of file data is taking time. I tried to read a 2 Gbyte file with the simple perl script

#!/usr/bin/perl while (<>) { }

and it took 39 seconds. Translated to 1 Terabyte this script would need 5 hours.

What part of that is perl and what is the hard disk? When I used

cat twogig.txt > /dev/null

it still took 25 seconds. Translated to 1 Terabyte that is 3.3 hours. So in my case 2/3 of the time is spent just by reading from disk, the rest can be contributed to not reading large chunks, i.e. the overhead of reading line by line.

Do these tests yourself and you will get the lower limit of what you can hope to achieve without either throwing faster hardware at it or preprocessing the data (if the file doesn't change all the time you might construct the hash on disk and use it more than once)

Replies are listed 'Best First'.
Re^2: optimization in file processing
by moritz (Cardinal) on Jul 08, 2011 at 10:18 UTC

    While your consideration is very valuable, I think it also shows that the OP's speed problem likely doesn't stem from IO, at least if his machine behaves similarly as yours.

    10 lakhs = 1 million, so to be IO bound his record size would need be in the order of 1 MB. I don't know what data the OP deals with, but typical CSV files have much smaller records.

      He said his file size is in the terabyte range (at least that's what I think his last sentence said). So either he has an atypical CSV file or one of his numbers is wrong.