bheckel has asked for the wisdom of the Perl Monks concerning the following question:

I'm using Perl to take a textfile and run a processor intensive activity on each line. I'd like to speed up the program by taking advantage of the dual processor Linux box on which it runs.

My plan is to fork a child then let the parent and child both do the processor intensive work simultaneously.

My problem is how to orchestrate the distribution of the textfile so that the parent grabs a line, the child grabs the next line and so on to EOF.

I've looked into perlipc but am not sure which strategy will work. I've searched perlmonks but have not had any luck in finding this problem described.

Thanks for any insight you may have.
Bob <monks.20.bheckel@spamgourmet.com>
  • Comment on Process file simultaneously after forking

Replies are listed 'Best First'.
Re: Process file simultaneously after forking
by Helter (Chaplain) on Nov 14, 2002 at 16:19 UTC
    They way I handled this for doing work on a 8 way beowolf cluster was to take the data set and split it into 8 chunks. This seemed easier than trying to deal 1 to each machine in order.

    In your case either split your file into 2 files before forking then each process deals with a half or (if possible) have the second seek to halfway through the file and start there, and make sure the first ends just before the second started.

    This should be much better than trying to synchronize each thread, if you do it that way then you don't gain nearly the performance of having 2 unsynchronized threads. (A lot of time would be spent waiting for the other thread to work).

    As I'm thinking of it, In each thread have the parent do odd lines and then have the child do even lines. As long as you are not modifying the original file you should be fine. (same concept as above).

    If you are going to modify the file then have each thread write it's own temproary file, then after the major processing is done (child exits), have the parent merge the output, based on how you divided the data set.

    If you could describe the nature of your problem it would make most of my guessing, and ideas go away and the "right" way to do it would become clear.

    Hope this helped!
Re: Process file simultaneously after forking
by perrin (Chancellor) on Nov 14, 2002 at 16:23 UTC
    There are relevant IPC examples in the Perl Cookbook. The approach that comes to mind for me is to simply have a file that contains the next line to process, and use file-locking to coordinate updates to that file between the two processes. You should read in chunks, not single lines, since single lines are processed so quickly that the lock file would become a bottleneck.
Re: Process file simultaneously after forking
by dingus (Friar) on Nov 14, 2002 at 16:09 UTC
    My problem is how to orchestrate the distribution of the textfile so that the parent grabs a line, the child grabs the next line and so on to EOF.

    I've looked into perlipc but am not sure which strategy will work.

    You don't say how big/complex the textfile is. Two ways that spring to mind - and which don't require any IPC - are to run a prefilter on the file. You could either just copy the file and then - depending on whether it was the forker or forkee read the odd/even lines. Or you could just slurp the entire file into an array and then use some cuuning manipulations to extract every other line. Something like

    my (@odd, @even); map { if($.&1){ push @odd,$_;} else { push @even,$_ } } while(<FILE>);

    Dingus


    Enter any 47-digit prime number to continue.
      Or more succintly,
      my (@odd, @even); push @{ $. & 1 ? \@odd : \@even }, $_ while <FILE>;

      Makeshifts last the longest.

Re: Process file simultaneously after forking
by iburrell (Chaplain) on Nov 14, 2002 at 17:52 UTC
    It would be simpler for the parent to fork two child processes that do the processing. The parent would just read the input, divide it among the children, and write lines to the children pipes. The children processes would read lines from stdin and process them. This way all the components are simple and you don't have to worry about synchronization because the operating system handle it through the pipes. Another advantage is that it scales to any number of children.

    This system does have a starvation problem. The parent will block on writing to the slowest child process and all the other children will wait on reading their next line. If this is a problem, you can try using non-blocking IO in the parent. The parent can then select an available child to write the line too.

Re: Process file simultaneously after forking
by waswas-fng (Curate) on Nov 14, 2002 at 16:32 UTC
    Fork is not the only option, 5.8 perl threads become a little more stable and may do what you are looking for a bit easier. Take a look at Thread::Pool

    -Waswas
Re: Process file simultaneously after forking
by bheckel (Beadle) on Nov 14, 2002 at 21:26 UTC
    Great advice. Thanks everyone for taking the time to reply.

    What I'm doing with each line (8,000 line file) involves querying a database then doing a fuzzy string match for each returned record against my string.

    I've tried the split file approach but I end up with almost the same total run time as without forking. I'm guessing it's because I can't divide the file evenly (some strings are more difficult to process) so either the parent or child finishes while the other is working away at the harder matches. I need the first one that finishes to keep working. So iburrell's non-blocking IO in the parent idea might be my next attempt.

    Bob