in reply to Best Practices for Uncompressing/Recompressing Files?

You can always install a thread enabled version of perl in the non default location just for this script. Or you can fork like is staed above. The question is what platform and how many CPUs is this running on? If it is decompressing and compressing there may be a CPU bottleneck to begin with (hence it is slow) sor threading or forking on a system may make performance slow if the threads are fighting for time on the same cpu. force contex switches can play not-so-fun games with your speed. If you have enough cpus on the box then fork or thread, if not consider taking dws's advice and make a buisness case for more diskspace.

-Waswas
  • Comment on Re: Best Practices for Uncompressing/Recompressing Files?

Replies are listed 'Best First'.
Re: Re: Best Practices for Uncompressing/Recompressing Files?
by biosysadmin (Deacon) on Aug 11, 2003 at 00:39 UTC
    I definitely could install a threaded version of Perl in a non-standard location, not a bad idea.

    It's running on a Sun Enterprise 450 server with 4 CPU's and 4 gigs of RAM, which makes me think that parallelizing could give me great performance gains. When uncompressing files, the CPU usage is always at 99-100% as viewed by top, so the operation appears to be cpu limited.

    As far as disk space goes, my lab is a Bioinformatics lab, and I'm installing more disk space this week. :) Unfortunately, I can't use it as temporary space for this project (I would need about 120gigs to uncompress all of the files). :(

    I'm thinking that I'll just try Parallel::ForkManager, if I can find the time while on vacation next week I may even write up a tutorial on the subject. Thanks for the tips everytone. :)
      I think you will find that the fork meathod on that box will give you better performance than threading. If you have sole access to that box while this is running limit your fork to 4 proccess, any more and you will see diminishing returns as the cpus will need to csw on the uncompress proccesses. If you end up using gzip, you may want to look at a gnu version -- I have seen 20% speed ups from the vanilla Sun bin. You may also want to investigate the difference the different compress flags have on your files -- if these are dna tag files a small dict with low compression may buy you the same compress ratio (or close) with way less cpu time.

      -Waswas
        Wow, sounds like you've had some experience in dealing with biology related flat files flat files. :)

        Thanks for the tips. Two specific responses:

        - I am definitely rewriting my script right now to use all 4 processors. As far as having sole access to the box, I'm the sysad of the box and I can definitely renice my jobs to take precedence over anyone else's stuff, but I'm at an academic institution and we don't have many people on our machines this late into the summer.

        - I am using the GNU version of gzip, and unfortunately all of the databases were originally compressed by the NCBI. I doubt they thought of optimizations when compressing them as they just recently started using gzip format, but that's definitely a great tip that I can use and pass around. :)