in reply to Re^6: Sharing Hash Question
in thread Sharing Hash Question

When I get back to work on Monday, I plan on updating the actual script. I'm dealing with files that have 30-45 million rows. I'm hoping spawning 10+ threads (on a 15 plus CPU machine) that are solely parsing the files should help reduce runtime; rather than sequentially working on each file one-by-one.

Replies are listed 'Best First'.
Re^8: Sharing Hash Question
by aaron_baugher (Curate) on Jul 07, 2012 at 14:07 UTC

    That'll depend to a large extent on your filesystem, and whether your bottleneck is in the IO of reading the files or in the parsing once lines are in memory. A bunch of threads all trying to read different files at the same time could possibly slow things down, if the disks have to keep jumping around from file to file to serve the different threads in turn. In general, it's probably faster to ask a disk for file1 and then file2 than to ask for both simultaneously. Filesystems have gotten pretty smart about such things, but ultimately the hardware can only do one thing at a time. If you have multiple disks, perhaps in RAID or a mirroring situation, then it may be possible to read more than one file at a time, and you could gain something. And if your parsing is complicated enough that you can parse a chunk from file1 while a chunk from file2 is being found and read, you could gain a lot.

    Of course, the only way to find out for sure will be to try it.

    Aaron B.
    Available for small or large Perl jobs; see my home node.

Re^8: Sharing Hash Question
by BrowserUk (Patriarch) on Jul 10, 2012 at 02:20 UTC
    When I get back to work on Monday, I plan on updating the actual script. I'm dealing with files that have 30-45 million rows. I'm hoping spawning 10+ threads (on a 15 plus CPU machine) that are solely parsing the files should help reduce runtime; rather than sequentially working on each file one-by-one.

    Here's my prediction: Even modified, your multi-threaded code ran significantly (an order or magnitude), more slowly than your single-threaded code.

    And my solution: Iff you supplied me with the information I requested, I could process those same files to a hash in 1/10th the time that your single-threaded code does.


    With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.

    The start of some sanity?