I've got a boss/worker design issue that I'd like to get some suggestions on. I have a script that I've built up over the years, now over 6300 lines (yes, it's an unholy beast). The program's job involves assembling a big hash (%dataset), each member of which is a "work unit" of a few dozen KB. %dataset can contain anywhere from 0 (effectively a no-op) up to about 10,000 work units. Work units are farmed out in queue fashion to a set of child worker processes. Each worker processes one work unit at a time (altering it) and returns it to the boss. Although not directly relevant to the problem, I'll go ahead and mention that the processing of a work unit is I/O bound and typically takes less than a minute (though in extreme cases it can take up to 10). I'd estimate the program runs about 100 times a day, often with a dataset consisting of a single work unit.
The boss process currently uses the following algorithm (yes, this is pseudo-code):
$max_children=30 Create IPC socket $SIG{CHLD}=\&reaper Assemble %dataset $num_children_required = minimum($max_children,num_work_units) foreach $num_children (1..$num_children_required) { fork() a worker } while($num_children > 0) { listen for children on socket get from child on socket: processed work unit (if any) if(num_unprocessed_work_units > 0) { tell child on socket: process work unit identified by $key } else { tell child on socket: quit } } sub reaper { $num_children--; }
$max_children was initially 50, but in recent months, I've had to reduce it to 30 as a workaround for the problem I've started experiencing. This problem is memory -- or rather, a lack of memory caused by an increase in the number of work units. Because %dataset is assembled before forking off all the worker processes, they each have a complete copy of the entire dataset. As the boss collects processed work units it updates %dataset accordingly which triggers copy-on-write for the part of %dataset that's being updated.
The obvious solution is to create all the worker processes BEFORE assembling %dataset, however it's only after %dataset is assembled that the boss knows how many worker processes it needs, so I have sort of a chicken-and-egg problem.
So far, I've come up with two solutions, neither of which gives me the warm fuzzies.
Does anybody see a solution that's better than either of the above in terms of SIMPLICITY and EFFICIENCY -- preferably one that doesn't require a major overhaul?
| For: | Use: | ||
| & | & | ||
| < | < | ||
| > | > | ||
| [ | [ | ||
| ] | ] |