richardwfrancis has asked for the wisdom of the Perl Monks concerning the following question:

Oh Monks of wisdom, I call upon you to help me on my quest.

I'm doing a project which involves parsing a large file that can be simplified as follows:

A1
B2
C1
C2
A2
B1

As I'm parsing the file I'm looking for pairs ie A1:A2, B1:B2 etc. Each member of the pair can exist anywhere in the file. When I find a pair of data, I need to test it by running it through a subroutine. This subroutine can run completely independently, however (this is the kicker) it needs to add results back to a hash that I then process after the file has been parsed.

Currently the parsing of the file has to wait until the subroutine finishes before continuing and as the sub takes a while to complete, this is slowing down my script.

What I'd like to do is for the user to specify a number of threads or cpus available and once a pair is found, split off the sub to a separate process (ensuring that the max number is not exceeded) to allow the parsing of the file to continue.

In terms of providing code, it's a complicated sub but I can simplify the whole thing as follows:

#!/usr/bin/perl use strict; # this is the hash I want to keep results in my %count = (); # there are way more lines than this but anyway... for my $i (1 .. 100){ # this just represents that I've found a pair and want to run the sub if ($i % 50 == 0){ # run the sub &inc(\%count); } } # I need access to the hash that's updated in the sub print $count{'count'}."\n"; sub inc { my $c = shift; # make a change to the hash $c->{'count'}++; # pausing for affect! # this simulates that the sub takes a while to run sleep(10); }

This code will simply run the sub twice, sleeping for 10 seconds each time. If I have multiple processors available is there a way of making this code use these resources and
1) make the code still output "2" but
2) run for 10 seconds

Cheers dudes,
Rich

Replies are listed 'Best First'.
Re: Is this a job for a fork?
by BrowserUk (Patriarch) on Jul 12, 2010 at 16:44 UTC

    Threads would be much simpler.

    This could be refined in any of a number of ways given a real problem description, but just supplying the filename as an input argument should work:

    #! perl -slw use strict; use threads ( stack_size => 4096 ); use threads::shared; use Thread::Queue; sub process { my $tid = threads->tid; my( $hashref, $Q ) = @_; while( my $key = $Q->dequeue ) { print "$tid : Processing $key"; sleep 10; lock $hashref; ++$hashref->{ $key }; } } our $T //= 16; my %hash :shared; my $Q = new Thread::Queue; my @t = map threads->create( \&process, \%hash, $Q ), 1 .. $T; while( my $key = <> ) { chomp $key; if( exists $hash{ $key } ) { $Q->enqueue( $key ); } else { $hash{ $key } = 0; } } $Q->enqueue( (undef) x $T ); $_->join for @t;

    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.
Re: Is this a job for a fork?
by JavaFan (Canon) on Jul 12, 2010 at 15:40 UTC
    You may be able to use fork to speed up the process. Whether it actually does is hard to say - measure it. Since you need to feed information back, I would have the parent parse the file; for each matching line, it forks a child - but not using fork directly, but using a pipe-open. Fork children until you've reached the maximum number of children. Then use a select loop over the pipes to see which child is done processing - if a child is done, read its results, process them, then parse the file for the next match. Rinse and repeat until you went through the entire file, and all children have finished.
Re: Is this a job for a fork?
by thundergnat (Deacon) on Jul 12, 2010 at 18:06 UTC

    As BrowserUk said, threads may be an easier choice. I noodled around with this before I saw his reply and came up with this adaptation of your script.

    #!/usr/bin/perl use warnings; use strict; use Data::Dumper; use threads; use threads::shared; my %count : shared; my $maxthreads = 3; # there are way more lines than this but anyway... for my $i ( 1 .. 500 ) { redo if ( scalar threads->list() >= $maxthreads ); # this just represents that I've found a pair and want to run the +sub if ( $i % 50 == 0 ) { # run the sub my $thread = threads->new( \&inc, $i, 2 ); } } while ( scalar threads->list() ) { sleep 1 } # I need access to the hash that's updated in the sub print Dumper \%count; sub inc { my ( $param1, $param2 ) = @_; #long subroutine using passed params # pausing for effect! # this simulates that the sub takes a while to run sleep( 5 + ( rand 5 ) ); # let you know something is happening print "$param1\n"; # make a change to the hash $count{$param1} = $param1 * $param2; threads->self()->detach; }

      This is exactly what I'm looking for thundergnat. Awesome thank you!!!

      Many thanks also to BrowserUk for his threads solution too.

      This work is actually part of some software I'm hoping to publish in a scientific journal so I'll be sure to acknowledge you both.

      Also thanks to JavaFan for the fork pseudocode and thomas11 for his suggestion too

      This site has never failed me. The monks have been taking their genius pills. Good monks.

      Cheers

      Rich

        Be aware. Whilst thundergnat's solution works, creating a new thread for every pairing is expensive (assuming there will be thousands of pairings). The Queue solution avoids this by re-using the threads


        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        "Science is about questioning the status quo. Questioning authority".
        In the absence of evidence, opinion is indistinguishable from prejudice.
Re: Is this a job for a fork?
by thomas11 (Novice) on Jul 13, 2010 at 08:09 UTC
    Sean O'Rourke's code for the Wide Finder benchmark does something very similar, you might need to change only a few lines. The idea is to partition the input file into n partitions for n processes, forked via a pipe open. Each one works independently on its part of the file, storing its result into a hash, then passing the hash back to the parent on STDIN using Storable. Then you just need to merge the hashes into one.
      The idea is to partition the input file into n partitions for n processes,

      Perhaps you can explain how this schema will allow "I'm looking for pairs ie A1:A2, B1:B2 etc. Each member of the pair can exist anywhere in the file. "?

        Yeah, I guess it doesn't. Don't know what I was thinking there. Damn, this was my first post on perlmonks :-(

        The partitioning solution would only work if you already had a list of potential pairs, which you could then check in parallel chunks.