shanu_040 has asked for the wisdom of the Perl Monks concerning the following question:

Hi monks,
I need your help to define a logic for my application. I want to search multiple sites from a single window(federated search) and I have already developed the search module for each sites to searched. I like to expedited search results: each search module returns results as soon as it is completed, instead of waiting for all searches to complete.
Can I use Threads for this purpose? How Thread::Pool Module will be help full? Kindly suggest me how to go for it?
Any help will be appreciable. Thanks Shanu

Replies are listed 'Best First'.
Re: Parallel Search using Thread::Pool
by BrowserUk (Patriarch) on Mar 17, 2009 at 09:55 UTC

    The simplest architecture is to create a Thread::Queue, then have each of your search modules run in separate threads and enqueue their results as they get them. Your main thread can the read them off the other of of that shared queue and display them.

    Thread::Pool will not be useful to you as it is meant to run many copies of the same routine concurrently, but your application calls for running different subroutines in each of your threads.

    Depending whether your application is web, gui or console based, you might also want to use a second queue or shared scalar to pass new search terms to your threads.


    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.
      Thanks, my application is web based. I have cerated different perl module(.pm) for each search source(site). Currently I am using Parallel::ForkManger. But the problem I am facing is, it waits for all the children to finish their task and then only I can display the result. I stuck on how to display the results as results get retrieved by a child and subsequently adding other children results for display. What will be the algorithm for the problem?
      thanks
      Shanu
        Hi monks,
        I am still waiting to get some kind of solution from your side.
      Could you please help me to develop this, I have tried the Thread::Queue. It takes too much time to retrieved the result from a source and I have to search more than 50 sources at a time. There could be more than 10 instances that would be running concurrently.
      How can main thread can read them off and display>
      following is the code I am using
      sub run_search { my ($self, $searches, $search_string, $site, $max_hits, $from_year +, $to_year) = @_; my $Qwork = new Thread::Queue; my $Qresults = new Thread::Queue; my $THREADS = scalar(keys %$searches); my @return; foreach my $obj (values %$searches) { eval { $obj->from_year($from_year); $obj->to_year($to_year); $obj->parse_search(); }; if ($@) { print STDERR "problem2 $@\n"; } $Qwork->enqueue($obj); } $Qwork->enqueue( (undef) x $THREADS ); my @pool = map{ threads->create( \&parallel_search, $Qwork, $Qresults, $max_hi +ts, $self->nuc_code) }1 .. $THREADS; for(1..$THREADS){ while( my $result = $Qresults->dequeue ){ push(@return, $result); } } ## Clean up the threads $_->join for @pool; return(\@return); } # # # the parallel server # # sub parallel_search { my ($Qwork, $Qresults, $max_hits, $nuc_code) = @_; my $tid = threads->tid; my %result; while(my $work = $Qwork->dequeue) { 'require ' . ref($work) . ';'; $work->max_hits($max_hits); $result{$work->resource_id} = $work->get_search_results($work- +>resource_id, $nuc_code); $Qresults->enqueue( \%result ); } $Qresults->enqueue( undef ); }
        It takes too much time to retrieved the result from a source

        How do you know it is taking too long? How long is too long? How are you measuring it?

        I'll try to help, but you are going to have to explain what you are doing a lot more clearly that you have to date. Are you trying to display the results on a web page as you get them?

        If so, that could be the source of your problems. Whilst not impossible, it is quite difficult to render web pages on-the-fly because HTML simply wasn't designed to work that way.

        If not, then you are going to have to describe or post the overall operation of the application, rather than just keep posting the same basic snippet.

        • What type of application is it?

          GUI; CLI, web app.

        • What are you searching?

          DBs, web pages; other?

        • You mention 50 searches and the possibility of 10 concurrent instances.

          Does each instance search all 50 sources? Are the all searching the same sources?

        I have looked at your earlier posts but as I do not understand what you are trying to achieve, it's hard to begin to help you. I don't the specific details of the data, but a clear overview of the dataflows is essential. Also, how long is it takling currently, and what is your target?


        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        "Science is about questioning the status quo. Questioning authority".
        In the absence of evidence, opinion is indistinguishable from prejudice.