msergeant has asked for the wisdom of the Perl Monks concerning the following question:

I am reading in a file currentley which can be 1-50000 lines long. I'd like to process this file in parallel, ideally something along the lines of.

8 threads or children ( if number of lines > 8).
all 8 split the file intelligently so that process 1 works on first 8th of file, 2 the second 8th and so on.
Once all 8 processes are finished the output they generate (field1,field2) is dropped into a csv file in the order that fields1 occurred.

So my question how possible is this and does anyone have any sample code for doing the threading, the rest I can pretty much work out, but I am wandering into the unknown when trying to do parallel processing.


Cheers,

Mark

Replies are listed 'Best First'.
Re: Forking processes / children / threads
by pg (Canon) on Nov 19, 2002 at 04:39 UTC
    Base on your description, I am thinking whether you really need multi-process or multi-thread for what you want to accomplish. Unless you have multiple processor, otherwise there is no need in this case, as I don't see any operation blocking others.

    Use multi-thread only when you need it. Multi-thread application usually takes more resources.

    Any way, in case there is some reasons, which you didn't tell us in your post, and those reasons make multi-thread/multi-process a must. Then multi-thread would be a better choice than multi-process, it is about resources and performance, especially when Perl's multi-threading is getting more and more mature.

      The machines I would be running this on are all dual processor (sun or intel). But the main thing is the actual overhead of my process (opening a socket to a local http server) is very low but the time for the system to run a 50k line file in serial vs splitting that file in two and running it is double for the one serial process vs 2 of them. I wanted to be able to run this script via a cron and have it do all the things automagically that I am currently doing by hand (splitting file into 8 then running 8 processes of my current script then concatenating it's output.) Unfortunately I can't try 5.8 as it has a few bugs with some of our other code that works fine with 5.6.1

      Cheers,

      Mark
        But the main thing is the actual overhead of my process (opening a socket to a local http server) is very low but the time for the system to run a 50k line file in serial vs splitting that file in two and running it is double for the one serial process vs 2 of them.

        Can you characterize the processing that you're doing on lines from this file? If it's compute intensive processing (with no program induced blocking for I/O), then you're not liable to see much improvement from processing sections of the file in parallel.

        Where multiple processes or threads win is where the processing involves activities that block for IO.

        To split the file, you can try the UNIX split command.
Re: Forking processes / children / threads
by rcaputo (Chaplain) on Nov 22, 2002 at 05:52 UTC

    Event driven systems are also handy for parallel processing, provided you have cooperative modules to do the work. This program reads a list of URLs on STDIN and dumps their responses to STDOUT.

    The program uses a cooperative HTTP user agent to run parallel web requests. It can take advantage of POE::Component::Client::DNS (a cooperative host resolver) if it is also installed.

    Sample use:

      perl -pe 's!^(\S+).*!http://$1.com/!' /usr/share/dict/words \
      | perl perlmonks-url-fetcher.perl

    -- Rocco Caputo - troc@pobox.com - poe.perl.org

    #!/usr/bin/perl # Fetch all manner of URLs from STDIN; dumping the text of their # responses on STDOUT. use warnings; use strict; sub MAX_PARALLEL () { 8 } # Number of requests to run at once. use POE; # Cooperative multitasking framewo +rk. use POE::Component::Client::HTTP; # Non-blocking HTTP requests modul +e. use HTTP::Request::Common qw(GET); ### Spawn the HTTP client component. It will be named "ua", which is ### short for "useragent". POE::Component::Client::HTTP->spawn(Alias => 'ua'); ### Start the session that will use the HTTP client. The _start event ### is fired by POE to kick-start a session. POE::Session->create( inline_states => { _start => \&initialize_session, got_response => \&handle_response, } ); ### Run the session that will visit pages. The run() function will ### not return until the session is through processing its last URL. $poe_kernel->run(); exit 0; ### Handle the _start event by setting up the session and starting an ### initial number of requests. As each request finishes, another ### will be started in its place. ### ### The $_[KERNEL] parameter convention is strange but useful. See: ### http://poe.perl.org/?POE_FAQ/Why_does_POE_pass_parameters_as_array +_slices sub initialize_session { my $kernel = $_[KERNEL]; for (1..MAX_PARALLEL) { my $next_url = <STDIN>; last unless defined $next_url; chomp $next_url; $kernel->post( "ua", # Post the request to the user age +nt. "request", # It is a request we're posting. "got_response", # The ua response should be "got_r +esponse". GET $next_url # The HTTP::Request to process. ); } } ### Receive a response and just dump it as_string() for demonstration ### purposes. Once dumped, it attempts to read and request yet ### another URL. The parameter convention is strange but useful ### again; this time pulling off only the values we need using a slice ### of @_. sub handle_response { my ($kernel, $heap, $req_packet, $resp_packet) = @_[KERNEL, HEAP, ARG0, ARG1]; my $http_request = $req_packet->[0]; # Original HTTP::Request my $http_response = $resp_packet->[0]; # Resulting HTTP::Response my $response_string = $http_response->as_string(); $response_string =~ s/^/| /mg; print ",---------- ", $http_request->uri," ----------\n"; print $response_string; print "`", '-' x 78, "\n"; # Start another request if it's available, or let the list of # pending URLs run out. The session will stop when it does run out. my $next_url = <STDIN>; if (defined $next_url) { chomp $next_url; $kernel->post(ua => request => got_response => GET $next_url); } }