I think you will find that the fork meathod on that box will give you better performance than threading. If you have sole access to that box while this is running limit your fork to 4 proccess, any more and you will see diminishing returns as the cpus will need to csw on the uncompress proccesses. If you end up using gzip, you may want to look at a gnu version -- I have seen 20% speed ups from the vanilla Sun bin. You may also want to investigate the difference the different compress flags have on your files -- if these are dna tag files a small dict with low compression may buy you the same compress ratio (or close) with way less cpu time.
-Waswas | [reply] |
Wow, sounds like you've had some experience in dealing with biology related flat files flat files. :)
Thanks for the tips. Two specific responses:
- I am definitely rewriting my script right now to use all 4 processors. As far as having sole access to the box, I'm the sysad of the box and I can definitely renice my jobs to take precedence over anyone else's stuff, but I'm at an academic institution and we don't have many people on our machines this late into the summer.
- I am using the GNU version of gzip, and unfortunately all of the databases were originally compressed by the NCBI. I doubt they thought of optimizations when compressing them as they just recently started using gzip format, but that's definitely a great tip that I can use and pass around. :)
| [reply] |