in reply to Re: buffering from a large file
in thread buffering from a large file

Hi all,

Thank you very much. I'll change the code to read line by line. And I'll check the forkmanager module as well. What I failed to mention is that, while the processing is done only on every 2nd line, the other lines are not discarded. Let me explain:

Assume the file as n quadruples. Each 4 lines (1:4, 5:8 etc..) belong as one entity. From each of these 4 lines I check/edit the 2nd line and then if it satisfies the criteria, then I have to write these 4 lines (3 of them untouched, yes) back to another output file, otherwise I don't, meaning those 4 lines are discarded for the output file.

Thanks again.

Replies are listed 'Best First'.
Re^3: buffering from a large file
by chrestomanci (Priest) on Mar 17, 2011 at 16:31 UTC

    Based on your further description, I would say that you should probably forget about ForkManager and multiple threads.

    If multiple threads are all trying to write to the same output file, then you will need to worry about locking the file so they don't corrupt it. The locking overhead will kill performance, and even if you solve it somehow random differences in how long each thread takes to run will mean that the order of the lines in the output file will get partly randomised, which you probably don't want.

    Instead I suggest you go for a single threaded solution that reads the input line by line and only keeps one group of four lines in memory at any one time. That way everything should be simple, and reasonably quick.

    I suggest that you add some regular expressions or other tests when you read lines, so that if an extra new line creeps in somehow the script can re-sync with the line groups and not break.

    If you are really desperate for maximum performance then you could investigate chopping your raw file up into chunks and then having separate scripts process each. If you do that then the code you have written to find where a group of four lines starts will come in handy.

      Chrestomanci, You're right. Probably not so wise to do it here. I recoded by reading the file line by line and its already real fast. However, I would like to learn about threads and forks in Perl. Could you point me to a good source? Thanks,

        Threads and forks are both fairly common topics here at the Monastery, so if you hang out here or search you will read lots of informative stuff about both topics. I posted a longer example on how to used ForkManager a couple of months ago. Many other wise brothers have done the same.

        I am much less familiar with threads, but I am sure there is plenty of good documentation out there. Again you should search the monastery archives, or use google.

        If you are still in need of enlightenment, then by all means ask another question, as it is the path to wisdom in this place.