Process file simultaneously after forking

bheckel has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: Process file simultaneously after forking by Helter (Chaplain) on Nov 14, 2002 at 16:19 UTC
They way I handled this for doing work on a 8 way beowolf cluster was to take the data set and split it into 8 chunks. This seemed easier than trying to deal 1 to each machine in order. In your case either split your file into 2 files before forking then each process deals with a half or (if possible) have the second seek to halfway through the file and start there, and make sure the first ends just before the second started. This should be much better than trying to synchronize each thread, if you do it that way then you don't gain nearly the performance of having 2 unsynchronized threads. (A lot of time would be spent waiting for the other thread to work). As I'm thinking of it, In each thread have the parent do odd lines and then have the child do even lines. As long as you are not modifying the original file you should be fine. (same concept as above). If you are going to modify the file then have each thread write it's own temproary file, then after the major processing is done (child exits), have the parent merge the output, based on how you divided the data set. If you could describe the nature of your problem it would make most of my guessing, and ideas go away and the "right" way to do it would become clear. Hope this helped!	[reply]
Re: Process file simultaneously after forking by perrin (Chancellor) on Nov 14, 2002 at 16:23 UTC
There are relevant IPC examples in the Perl Cookbook. The approach that comes to mind for me is to simply have a file that contains the next line to process, and use file-locking to coordinate updates to that file between the two processes. You should read in chunks, not single lines, since single lines are processed so quickly that the lock file would become a bottleneck.	[reply]
Re: Process file simultaneously after forking by dingus (Friar) on Nov 14, 2002 at 16:09 UTC
My problem is how to orchestrate the distribution of the textfile so that the parent grabs a line, the child grabs the next line and so on to EOF. I've looked into perlipc but am not sure which strategy will work. You don't say how big/complex the textfile is. Two ways that spring to mind - and which don't require any IPC - are to run a prefilter on the file. You could either just copy the file and then - depending on whether it was the forker or forkee read the odd/even lines. Or you could just slurp the entire file into an array and then use some cuuning manipulations to extract every other line. Something like `my (@odd, @even); map { if($.&1){ push @odd,$_;} else { push @even,$_ } } while(<FILE>);` [download] Dingus `Enter any 47-digit prime number to continue.`	[reply] [d/l]
Re^2: Process file simultaneously after forking by Aristotle (Chancellor) on Nov 15, 2002 at 16:34 UTC
Or more succintly, `my (@odd, @even); push @{ $. & 1 ? \@odd : \@even }, $_ while <FILE>;` [download] Makeshifts last the longest.	[reply] [d/l]
Re: Process file simultaneously after forking by iburrell (Chaplain) on Nov 14, 2002 at 17:52 UTC
It would be simpler for the parent to fork two child processes that do the processing. The parent would just read the input, divide it among the children, and write lines to the children pipes. The children processes would read lines from stdin and process them. This way all the components are simple and you don't have to worry about synchronization because the operating system handle it through the pipes. Another advantage is that it scales to any number of children. This system does have a starvation problem. The parent will block on writing to the slowest child process and all the other children will wait on reading their next line. If this is a problem, you can try using non-blocking IO in the parent. The parent can then select an available child to write the line too.	[reply]
Re: Process file simultaneously after forking by waswas-fng (Curate) on Nov 14, 2002 at 16:32 UTC
Fork is not the only option, 5.8 perl threads become a little more stable and may do what you are looking for a bit easier. Take a look at Thread::Pool -Waswas	[reply]
Re: Process file simultaneously after forking by bheckel (Beadle) on Nov 14, 2002 at 21:26 UTC
Great advice. Thanks everyone for taking the time to reply. What I'm doing with each line (8,000 line file) involves querying a database then doing a fuzzy string match for each returned record against my string. I've tried the split file approach but I end up with almost the same total run time as without forking. I'm guessing it's because I can't divide the file evenly (some strings are more difficult to process) so either the parent or child finishes while the other is working away at the harder matches. I need the first one that finishes to keep working. So iburrell's non-blocking IO in the parent idea might be my next attempt. Bob	[reply]