Re: Re: Best Practices for Uncompressing/Recompressing Files?

I definitely could install a threaded version of Perl in a non-standard location, not a bad idea.

It's running on a Sun Enterprise 450 server with 4 CPU's and 4 gigs of RAM, which makes me think that parallelizing could give me great performance gains. When uncompressing files, the CPU usage is always at 99-100% as viewed by top, so the operation appears to be cpu limited.

As far as disk space goes, my lab is a Bioinformatics lab, and I'm installing more disk space this week. :) Unfortunately, I can't use it as temporary space for this project (I would need about 120gigs to uncompress all of the files). :(

I'm thinking that I'll just try Parallel::ForkManager, if I can find the time while on vacation next week I may even write up a tutorial on the subject. Thanks for the tips everytone. :)

Comment on Re: Re: Best Practices for Uncompressing/Recompressing Files?

Replies are listed 'Best First'.
Re: Re: Re: Best Practices for Uncompressing/Recompressing Files? by waswas-fng (Curate) on Aug 11, 2003 at 03:30 UTC
I think you will find that the fork meathod on that box will give you better performance than threading. If you have sole access to that box while this is running limit your fork to 4 proccess, any more and you will see diminishing returns as the cpus will need to csw on the uncompress proccesses. If you end up using gzip, you may want to look at a gnu version -- I have seen 20% speed ups from the vanilla Sun bin. You may also want to investigate the difference the different compress flags have on your files -- if these are dna tag files a small dict with low compression may buy you the same compress ratio (or close) with way less cpu time. -Waswas	[reply]
Re: Re: Re: Re: Best Practices for Uncompressing/Recompressing Files? by Anonymous Monk on Aug 14, 2003 at 07:49 UTC
Wow, sounds like you've had some experience in dealing with biology related flat files flat files. :) Thanks for the tips. Two specific responses: - I am definitely rewriting my script right now to use all 4 processors. As far as having sole access to the box, I'm the sysad of the box and I can definitely renice my jobs to take precedence over anyone else's stuff, but I'm at an academic institution and we don't have many people on our machines this late into the summer. - I am using the GNU version of gzip, and unfortunately all of the databases were originally compressed by the NCBI. I doubt they thought of optimizations when compressing them as they just recently started using gzip format, but that's definitely a great tip that I can use and pass around. :)	[reply]