comment on

Here's what I would do.

First, write your program to take a list of filenames on the command-line, process all of them, then exit. This gives you maximum flexibility in calling the script.

Second, install GNU xargs and find if you don't already have them. Linux/BSD will come with these; commercial Unices won't.

Now you have everything you need to parallelize this process. Simply use the -P flag to GNU xargs:

  find . -type f -print0 |xargs -0 -P 4

will start up to 4 copies of your program in parallel, feeding each of them as long a list of files as will work. When one batch finishes, another will be started with another batch.

  find . -type f -print0 |xargs -0 -n 1 -P 6

will start up 6 copies of your program in parallel, processing one file each. When one copy finishes, the next will be started. You can vary this process and experiment by writing the file list to another file, then processing chunks of this. If your filenames don't have spaces in them, you can use simple tools like split, head, and tail to do this; otherwise you'll have to write short Perl scripts to deal with a null-terminated list of files.

I would also consider using pipes and/or Compress::Zlib to minimize disk I/O. If you're decompressing to a temp file, then converting this and writing to another file, then compressing the written file, you're effectively writing the file to disk twice uncompressed, and once compressed. Further, while the blocks should mostly be in your buffer cache so not actually read from disk, the different copies of the file are wasting memory with multiple copies of the same file. If you could turn this into something like:

  gunzip -c <file.gz |converter |gzip -c >newfile.gz
  mv newfile.gz file.gz

you would only write the file to disk once compressed, and never uncompressed. This should save you tons of I/O and buffer cache memory (although, as always, YMMV and you should benchmark to see for sure).

In reply to Re: Best Practices for Uncompressing/Recompressing Files? by sgifford
in thread Best Practices for Uncompressing/Recompressing Files? by biosysadmin

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.