hhheng has asked for the wisdom of the Perl Monks concerning the following question:
I tried to develop a script to grab urls from a website, and since it's a very big site I need to fork many processes and then use Storable to share the data among processes.
Parent process will fetch the main page for some urls, put in a hash and array. The hash is for containin the urls, while the array is also for containation of the urls, but will be used for iteration (shift one url each time) and 0 urls as the end of iteration. Then the child process will fetch pages for links, put into the hash and array.
The design is that the child process will work on the shared hash and array, but my script actually all copy the hash and array from parent process. See script below:
Please see the test result by this link: http://www.aobu.net/cgi-bin/test_gseSM.pl. And you can see each child process is doing the same thing without sharing %urls and @unique_urls between them.
my %urls; #hash to contain all the urls my @unique_urls; #Array contains all the urls for iteration my $base = "http://www.somedomain.com"; my $mech = WWW::Mechanize->new; $mech->get($base); #### Start point of %urls and for start crawling %urls = my_own_sub($mech->links); #My own sub to process&extract l +inks from the page, with key is the link @unique_urls = keys %urls; lock_store \%url, 'url_ref'; lock_store \@unique_url, 'unique_ref'; my @child_pids; for($i=0; $i<10; $i++){ $pid = fork(); push @child_pids, $pid; die "Couldn't fork $!" unless defined $pid; unless($pid){ $url_ref = lock_retrieve('url_ref'); $unique_ref = lock_retrieve('unique_ref'); print "Number url: ", scalar(keys %$url_ref), "num-unique_url: ", +scalar(@$unique_ref), "\n"; while($cnt<100 && (my $u=shift @$unique_ref)){ #each fork maximum +process 100 urls; $mech->get($u); %links = my_own_sub($mech->links); foreach my $link(sort keys %links){ next if existed $url_ref->{$link}; push @$unique_ref, $link; $url_ref->{$link} = 1; } } lock_store $url_ref, 'url_ref'; lock_store $unique_ref, 'unique_ref'; sleep(1); exit(0); } } waitpid($_, 0) foreach @child_pids; $url_ref = lock_retrieve('url_ref'); $unique_ref = lock_retrieve('unique_ref'); print $_, "\n" foreach(sort keys %$url_ref); print "Number of links left to be crawled: ", scalar(@$unique_ref), "\ +n";
Testing the code with a small size website, and found that each forked child process will get the %urls and @unique_urls from the parent process which marked as the start point, while my aim is that each child process will write to %urls, and each process will shift urls from and then push urls into @unique_urls, and then each process will retrieve the other child process modified %urls an @unique_urls.
I don't want to use other modules like IPC::Sharable, Parallel::ForkManager, etc to achieve my aim, and just want to use fork and Storable module.
Can anybody tell me what's wrong in my script?
|
|---|
| Replies are listed 'Best First'. | |
|---|---|
|
Re: Storable problem of data sharing in multiprocess
by jellisii2 (Hermit) on Oct 03, 2014 at 11:29 UTC | |
by hhheng (Initiate) on Oct 03, 2014 at 12:08 UTC | |
|
Re: Storable problem of data sharing in multiprocess
by Anonymous Monk on Oct 03, 2014 at 08:58 UTC | |
by hhheng (Initiate) on Oct 03, 2014 at 09:15 UTC | |
by Anonymous Monk on Oct 03, 2014 at 09:34 UTC | |
by hhheng (Initiate) on Oct 03, 2014 at 11:59 UTC | |
by Anonymous Monk on Oct 04, 2014 at 02:03 UTC |