in reply to Best Practices for Uncompressing/Recompressing Files?

Here's what I would do.

First, write your program to take a list of filenames on the command-line, process all of them, then exit. This gives you maximum flexibility in calling the script.

Second, install GNU xargs and find if you don't already have them. Linux/BSD will come with these; commercial Unices won't.

Now you have everything you need to parallelize this process. Simply use the -P flag to GNU xargs:

  find . -type f -print0 |xargs -0 -P 4
will start up to 4 copies of your program in parallel, feeding each of them as long a list of files as will work. When one batch finishes, another will be started with another batch.
  find . -type f -print0 |xargs -0 -n 1 -P 6
will start up 6 copies of your program in parallel, processing one file each. When one copy finishes, the next will be started. You can vary this process and experiment by writing the file list to another file, then processing chunks of this. If your filenames don't have spaces in them, you can use simple tools like split, head, and tail to do this; otherwise you'll have to write short Perl scripts to deal with a null-terminated list of files.

I would also consider using pipes and/or Compress::Zlib to minimize disk I/O. If you're decompressing to a temp file, then converting this and writing to another file, then compressing the written file, you're effectively writing the file to disk twice uncompressed, and once compressed. Further, while the blocks should mostly be in your buffer cache so not actually read from disk, the different copies of the file are wasting memory with multiple copies of the same file. If you could turn this into something like:

  gunzip -c <file.gz |converter |gzip -c >newfile.gz
  mv newfile.gz file.gz
you would only write the file to disk once compressed, and never uncompressed. This should save you tons of I/O and buffer cache memory (although, as always, YMMV and you should benchmark to see for sure).
  • Comment on Re: Best Practices for Uncompressing/Recompressing Files?

Replies are listed 'Best First'.
Re: Re: Best Practices for Uncompressing/Recompressing Files?
by waswas-fng (Curate) on Aug 11, 2003 at 07:25 UTC
    Just for the record solaris native xargs supports -P and -0. Solaris also comes with find. -

    -Waswas

      Huh. I just scanned the manpage for Solaris 8 find and xargs, and they don't mention this. These arguments also give errors when I try them from the command-line:

      bash-2.04$ uname -a SunOS robotron.gpcc.itd.umich.edu 5.8 Generic_108528-18 sun4u sparc SU +NW,UltraAX-e2 bash-2.04$ find . -print0 find: bad option -print0 find: path-list predicate-list bash-2.04$ find . -print |xargs -P 4 echo xargs: illegal option -- P xargs: Usage: xargs: [-t] [-p] [-e[eofstr]] [-E eofstr] [-I replstr] [ +-i[replstr]] [-L #] [-l[#]] [-n # [-x]] [-s size] [cmd [args ...]]bas +h-2.04$ bash-2.04$ find . -print |xargs -0 echo xargs: illegal option -- 0 xargs: Usage: xargs: [-t] [-p] [-e[eofstr]] [-E eofstr] [-I replstr] [ +-i[replstr]] [-L #] [-l[#]] [-n # [-x]] [-s size] [cmd [args ...]]bas +h-2.04$ bash-2.04$ which find /usr/bin/find bash-2.04$ which xargs /usr/bin/xargs

      Perhaps the GNU versions are provided in later versions of Solaris?

        lol, my bad I forgot xargs and find are my bundle I auto install at the end of my jumpstart installation scripts. Been doing it so long I forgot that I did it. <- -- waswas ++ sgifford

        -Waswas