Storable problem of data sharing in multiprocess

hhheng has asked for the wisdom of the Perl Monks concerning the following question:

I tried to develop a script to grab urls from a website, and since it's a very big site I need to fork many processes and then use Storable to share the data among processes.

Parent process will fetch the main page for some urls, put in a hash and array. The hash is for containin the urls, while the array is also for containation of the urls, but will be used for iteration (shift one url each time) and 0 urls as the end of iteration. Then the child process will fetch pages for links, put into the hash and array.

The design is that the child process will work on the shared hash and array, but my script actually all copy the hash and array from parent process. See script below:

Please see the test result by this link: http://www.aobu.net/cgi-bin/test_gseSM.pl. And you can see each child process is doing the same thing without sharing %urls and @unique_urls between them.

my %urls;                 #hash to contain all the urls
my @unique_urls;          #Array contains all the urls for iteration
my $base = "http://www.somedomain.com";
my $mech = WWW::Mechanize->new;
$mech->get($base);
#### Start point of %urls and for start crawling
%urls = my_own_sub($mech->links);     #My own sub to process&extract l
+inks from the page, with key is the link
@unique_urls = keys %urls;
lock_store \%url, 'url_ref';
lock_store \@unique_url, 'unique_ref';
my @child_pids;
for($i=0; $i<10; $i++){
  $pid = fork();
  push @child_pids, $pid;
  die "Couldn't fork $!" unless defined $pid;
  unless($pid){
    $url_ref = lock_retrieve('url_ref');
    $unique_ref = lock_retrieve('unique_ref');
    print "Number url: ", scalar(keys %$url_ref), "num-unique_url: ", 
+scalar(@$unique_ref), "\n";
    while($cnt<100 && (my $u=shift @$unique_ref)){ #each fork maximum 
+process 100 urls;
      $mech->get($u);
      %links = my_own_sub($mech->links);
      foreach my $link(sort keys %links){
        next if existed $url_ref->{$link};
        push @$unique_ref, $link;
        $url_ref->{$link} = 1;
      }
    }
    lock_store $url_ref, 'url_ref';
    lock_store $unique_ref, 'unique_ref';
    sleep(1);
    exit(0);
  }
}
waitpid($_, 0) foreach @child_pids;
$url_ref = lock_retrieve('url_ref');
$unique_ref = lock_retrieve('unique_ref');
print $_, "\n" foreach(sort keys %$url_ref);
print "Number of links left to be crawled: ", scalar(@$unique_ref), "\
+n";
[download]

Testing the code with a small size website, and found that each forked child process will get the %urls and @unique_urls from the parent process which marked as the start point, while my aim is that each child process will write to %urls, and each process will shift urls from and then push urls into @unique_urls, and then each process will retrieve the other child process modified %urls an @unique_urls.

I don't want to use other modules like IPC::Sharable, Parallel::ForkManager, etc to achieve my aim, and just want to use fork and Storable module.

Can anybody tell me what's wrong in my script?

Comment on Storable problem of data sharing in multiprocess Download Code

Replies are listed 'Best First'.
Re: Storable problem of data sharing in multiprocess by jellisii2 (Hermit) on Oct 03, 2014 at 11:29 UTC
This is kind of unrelated, but unless the targets of this script know you're doing this work, please process and respect robots.txt.	[reply]
Re^2: Storable problem of data sharing in multiprocess by hhheng (Initiate) on Oct 03, 2014 at 12:08 UTC
Will do check the robots.txt in another module, but first we need to make this script work first.	[reply]
Re: Storable problem of data sharing in multiprocess by Anonymous Monk on Oct 03, 2014 at 08:58 UTC
Can anybody tell me what's wrong in my script? Maybe, does it work? What does it do correctly? What does it do incorrectly? You should be able to tell by running your program against your test website if there is a problem that needs solving :) yes I am a comedian	[reply]
Re^2: Storable problem of data sharing in multiprocess by hhheng (Initiate) on Oct 03, 2014 at 09:15 UTC
Please see the test result by this link: http://www.aobu.net/cgi-bin/test_gseSM.pl. And you can see each child process is doing the same thing without sharing %urls and @unique_urls between them.	[reply]
Re^3: Storable problem of data sharing in multiprocess (only parent partitions jobs) by Anonymous Monk on Oct 03, 2014 at 09:34 UTC
Here is what I would do , partition the url list upfront into different storable files, so when you fork you're only sharing a single filename, then later the parent process unifies the results of the child process ... only parent partitions jobs because only parent spawns children #!/usr/bin/perl -- ## ## ## ## perltidy -olq -csc -csci=3 -cscl="sub : BEGIN END " -otr -opr -ce +-nibc -i=4 -pt=0 "-nsak=" ## perltidy -olq -csc -csci=10 -cscl="sub : BEGIN END if " -otr -opr +-ce -nibc -i=4 -pt=0 "-nsak=" ## perltidy -olq -csc -csci=10 -cscl="sub : BEGIN END if while " -otr + -opr -ce -nibc -i=4 -pt=0 "-nsak=*" #!/usr/bin/perl -- use strict; use warnings; use Data::Dump qw/ dd /; Main( @ARGV ); exit( 0 ); sub Main { my @files = StorePartitionUrls( GetInitialUniqueUrls() ); ForkThisStuff( @files ); UnifyChildResults( 'Ohmy-unique-hostname-urls-storable', @files ); } ## end sub Main sub GetInitialUniqueUrls { my @urls; ... return \@urls; } ## end sub GetInitialUniqueUrls sub ForkThisStuff { ## spawn kids with one file, wait, whatever for my $file ( @files ) { EachChildGetsItsOwn( $file ); } } ## end sub ForkThisStuff sub ForkThisStuff { for my $file( @files ){ ## something forking here EachChildGetsItsOwn( $file ); } } sub StorePartitionUrls { my( $urls , $partition , $fnamet, ) = @_; $partition \|\|= 100; $fnamet \|\|= 'Ohmy-candidate-urls-%d-%d-storable'; my @files; while( @$urls ){ my @hundred = splice @$urls, 0, $partition ; #~ my $file = "Ohmy-".int( @$urls ).'-'.int( @hundred ).'-s +torable'; my $file = sprintf $fnamet, int( @$urls ), int( @hundred ); lock_store \@hundred, $file; push @files, $file; } return @files; } ## end sub StorePartitionUrls __END__ [download]	[reply] [d/l]
Re^4: Storable problem of data sharing in multiprocess (only parent partitions jobs) by hhheng (Initiate) on Oct 03, 2014 at 11:59 UTC
Re^5: Storable problem of data sharing in multiprocess (only parent partitions jobs) by Anonymous Monk on Oct 04, 2014 at 02:03 UTC