optimization in file processing

Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Hi monks

My concern is, the file which have lakhs of records each field separated by semicolon. So i need to parse each record and separate the fields and do the some calculation, based on the satisfy condition need to save result into the different files.

And also here i need to do some of fields in different records which satisfy the some condition to the aggregation on those fields, for this i am making hash at end of the file do the aggregation and write into the file.

So on this process for 10 lakhs records, taking time of 3 hours. So i need to do optimize it. So, here not getting idea either reading the line by line of tera byte file taking long time or saving content into memory(hash) at end put into file takes time?

Comment on optimization in file processing

Replies are listed 'Best First'.
Re: optimization in file processing by jethro (Monsignor) on Jul 07, 2011 at 12:50 UTC
Reading terabytes of file data is taking time. I tried to read a 2 Gbyte file with the simple perl script `#!/usr/bin/perl while (<>) { }` [download] and it took 39 seconds. Translated to 1 Terabyte this script would need 5 hours. What part of that is perl and what is the hard disk? When I used `cat twogig.txt > /dev/null` [download] it still took 25 seconds. Translated to 1 Terabyte that is 3.3 hours. So in my case 2/3 of the time is spent just by reading from disk, the rest can be contributed to not reading large chunks, i.e. the overhead of reading line by line. Do these tests yourself and you will get the lower limit of what you can hope to achieve without either throwing faster hardware at it or preprocessing the data (if the file doesn't change all the time you might construct the hash on disk and use it more than once)	[reply] [d/l] [select]
Re^2: optimization in file processing by moritz (Cardinal) on Jul 08, 2011 at 10:18 UTC
While your consideration is very valuable, I think it also shows that the OP's speed problem likely doesn't stem from IO, at least if his machine behaves similarly as yours. 10 lakhs = 1 million, so to be IO bound his record size would need be in the order of 1 MB. I don't know what data the OP deals with, but typical CSV files have much smaller records. Perl 6 - second systems done right	[reply]
Re^3: optimization in file processing by jethro (Monsignor) on Jul 08, 2011 at 11:57 UTC
He said his file size is in the terabyte range (at least that's what I think his last sentence said). So either he has an atypical CSV file or one of his numbers is wrong.	[reply]
Re: optimization in file processing by moritz (Cardinal) on Jul 07, 2011 at 12:08 UTC
Read perlperf. Follow its advice, most importantly about profiling (Devel::NYTProf). If the code is not too long, you can also post it here, maybe somebody can spot an obvious bottleneck. Perl 6 - second systems done right	[reply]
Re: optimization in file processing by BrowserUk (Patriarch) on Jul 08, 2011 at 12:31 UTC
You are far more likely to get useful answers if you post your script so we can see exactly what you are doing.	[reply]