Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Hi monks

My concern is, the file which have lakhs of records each field separated by semicolon. So i need to parse each record and separate the fields and do the some calculation, based on the satisfy condition need to save result into the different files.

And also here i need to do some of fields in different records which satisfy the some condition to the aggregation on those fields, for this i am making hash at end of the file do the aggregation and write into the file.

So on this process for 10 lakhs records, taking time of 3 hours. So i need to do optimize it. So, here not getting idea either reading the line by line of tera byte file taking long time or saving content into memory(hash) at end put into file takes time?

Replies are listed 'Best First'.
Re: optimization in file processing
by jethro (Monsignor) on Jul 07, 2011 at 12:50 UTC

    Reading terabytes of file data is taking time. I tried to read a 2 Gbyte file with the simple perl script

    #!/usr/bin/perl while (<>) { }

    and it took 39 seconds. Translated to 1 Terabyte this script would need 5 hours.

    What part of that is perl and what is the hard disk? When I used

    cat twogig.txt > /dev/null

    it still took 25 seconds. Translated to 1 Terabyte that is 3.3 hours. So in my case 2/3 of the time is spent just by reading from disk, the rest can be contributed to not reading large chunks, i.e. the overhead of reading line by line.

    Do these tests yourself and you will get the lower limit of what you can hope to achieve without either throwing faster hardware at it or preprocessing the data (if the file doesn't change all the time you might construct the hash on disk and use it more than once)

      While your consideration is very valuable, I think it also shows that the OP's speed problem likely doesn't stem from IO, at least if his machine behaves similarly as yours.

      10 lakhs = 1 million, so to be IO bound his record size would need be in the order of 1 MB. I don't know what data the OP deals with, but typical CSV files have much smaller records.

        He said his file size is in the terabyte range (at least that's what I think his last sentence said). So either he has an atypical CSV file or one of his numbers is wrong.
Re: optimization in file processing
by moritz (Cardinal) on Jul 07, 2011 at 12:08 UTC
Re: optimization in file processing
by BrowserUk (Patriarch) on Jul 08, 2011 at 12:31 UTC

    You are far more likely to get useful answers if you post your script so we can see exactly what you are doing.