That'll depend to a large extent on your filesystem, and whether your bottleneck is in the IO of reading the files or in the parsing once lines are in memory. A bunch of threads all trying to read different files at the same time could possibly slow things down, if the disks have to keep jumping around from file to file to serve the different threads in turn. In general, it's probably faster to ask a disk for file1 and then file2 than to ask for both simultaneously. Filesystems have gotten pretty smart about such things, but ultimately the hardware can only do one thing at a time. If you have multiple disks, perhaps in RAID or a mirroring situation, then it may be possible to read more than one file at a time, and you could gain something. And if your parsing is complicated enough that you can parse a chunk from file1 while a chunk from file2 is being found and read, you could gain a lot.
Of course, the only way to find out for sure will be to try it.
Aaron B.
Available for small or large Perl jobs; see my home node.
| [reply] |
When I get back to work on Monday, I plan on updating the actual script. I'm dealing with files that have 30-45 million rows. I'm hoping spawning 10+ threads (on a 15 plus CPU machine) that are solely parsing the files should help reduce runtime; rather than sequentially working on each file one-by-one.
Here's my prediction: Even modified, your multi-threaded code ran significantly (an order or magnitude), more slowly than your single-threaded code.
And my solution: Iff you supplied me with the information I requested, I could process those same files to a hash in 1/10th the time that your single-threaded code does.
With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
| [reply] |