in reply to how to split huge file reading into multiple threads

My intuitive guess, based on your task-description, is that your algorithm is probably memory-based, and what is therefore actually happening is “classic thrashing.”   In this case, threads won’t help at all.

Consider ways to use disk-based sorting to manage the files.   Or, put the data into an SQLite database (disk file...) and use its indexing and querying capability.   The bottom line is ... don’t do anything “in memory.”   That means:   no hashes, no lists, no “potentially big things in memory” at all.

An appropriate redesign should not blink at all at “millions of records.”   But we do know that the classic performance-curve caused by thrashing is ... not linear, but exponential ... degradation.   When you say, “2+ hours,” that’s what it fairly screams to me.

Easy test:   fire up the program and use a separate system monitor to watch the swap I/O rate, and the percentage of time spent in page faults.   If it is, as I suspect it will be, “huge,” then there’s your answer.

Replies are listed 'Best First'.
Re^2: how to split huge file reading into multiple threads
by onelesd (Pilgrim) on Aug 23, 2011 at 17:54 UTC
    If you are on linux, try running iostat -mx 5 while your threaded non-threaded perl script is processing the input. It will show you where the time is being spent. If most of your time is being spent in %iowait rather than %user, then you need to do something to reduce IO to reduce execution time, and as the previous post said, threads will not help you if this is the case.
    avg-cpu: %user %nice %system %iowait %steal %idle 0.05 0.00 0.10 0.05 0.00 99.80 Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s avgrq-sz + avgqu-sz await svctm %util hda 0.00 5.80 0.20 2.00 0.00 0.03 29.82 + 0.01 4.45 2.36 0.52

      I used "iostat -mx 5" and found that %iowait is 0.60; whereas the %user is: 51.xx. That means its not the io stuff taking time. That means, Threads can be used and can reduce the time of execution. Happy to get to know that. Please suggest (based on other my comments and post; how can I use threads) ?

        Did you wait for several measurements from iostat, say over 30s? Utilization may be spikey.

        "%util" is a measurement of how hard your disk drive is working, on a scale from 0-100. What's in that column?

        Pasting the output of a 30s measurement would be enlightening.

Re^2: how to split huge file reading into multiple threads
by sagarika (Novice) on Aug 30, 2011 at 10:19 UTC

    Can not use things like SQLite. Thats how my boss wants it to be.

    Can you please suggest a system monitor to watch the swap I/O ?