comment on

Dear monks,

I have a process pipeline that consists of the following steps:

Convert data to TSV (using a Perl script)
Launch the database bulk loader to import the TSV into the DB

I have a machine that has 4 CPUs and hence I want to maximize CPU utilization, as the data takes some while to get converted. Step 1 can be easily parallelized, but step 2 must be serialized as the database bulk loader fails if another bulk loading process is currently loading into the same table. I also want to keep the disk space usage low, so I want to import the data filie as soon as it is written to disk instead of importing all data after converting it. So I need a semaphore on step 2.

I have written a set of shell scripts that nicely do this, using the runN utility by Dominus for convenient parallelization:

BASE=$(cd $(dirname $0); pwd)
echo $DAYS
THIS=$(basename $0)
DB_SEMAPHORE=/tmp/$THIS.$$.import
rm -f $DB_SEMAPHORE
touch $DB_SEMAPHORE
export DB_SEMAPHORE
echo "Launching reader in $DB_SEMAPHORE"
(cd ..; tail -f $DB_SEMAPHORE | xargs -i ./load.sh {} >>$BASE/import.l
+og )&

# ./convert.sh echoes the appropriate parameters into $DB_SEMAPHORE
../runN -n 4 ./convert.sh $DAYS

# Signal EOF to xargs
echo "_" >> $DB_SEMAPHORE
wait
echo "Import done"
rm $DB_SEMAPHORE
[download]

There are lots of ugly parts to this shell script, but it works. The ugly parts are:

In (cd ..; tail -f $DB_SEMAPHORE | xargs -i ./load.sh {} >>$BASE/import.log )& even after the import is done, the tail -f process stays alive, at least when I prematurely cancel the shell script.
The parallelization logic in ../runN -n 4 ./convert-wp-for-import.sh $DAYS is topologically separated from the parameter passing logic between the converter and the importer. There is too much action at a distance happening here.

What I'd like is an easy way to write the whole parallelize-then-serialize stuff in Perl. Simple forking doesn't work because I need to pass the data "back up" to the parent process or downwards in a serial fashion so that only one DB import runs at a time, preferrably still without blocking the overall progress, so that all 4 CPUs keep running. Also, of course it would be much nicer to pass around Perl data structures instead of having to manually make sure that the number of columns in the converter script is identical to the number of columns expected by the importer script.

I envision as an imaginary API something like the following:

use strict;
use Magic::Parallel max_parallel => 4;

my $parallel_handle = parallel sub {
    my ($self,$payload) = @_;
    system("convert.sh $payload") == 0
        or warn "Couldn't launch: $!/$?";
}, @ARGV;

$parallel_handle->serial(sub {
    my ($self,$payload) = @_;
    system("load_db.sh $payload") == 0
        or warn "Couldn't launch: $!/$?";
});
[download]

In practice, the next step would be to eliminate the wrapping shell scripts and to replace them by the real Perl code.

Has anybody done something like this? Is there anything that shields me from serializing the data and then deserializing it like I'd have to do with Parallel::ForkManager?

Update: Just after posting (not after previewing) this, I realize that this would be a prime application for threads, at least under Windows. The target machine runs HP-UX, but at least it's an ActiveState build so threads should be available there too. Is writing a smallish wrapper around threads and Thread::Queue the way to go then?

In reply to Converting a parallel-serial shell script by Corion

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.