in reply to Re^2: Parallel Search using Thread::Pool
in thread Parallel Search using Thread::Pool

It takes too much time to retrieved the result from a source

How do you know it is taking too long? How long is too long? How are you measuring it?

I'll try to help, but you are going to have to explain what you are doing a lot more clearly that you have to date. Are you trying to display the results on a web page as you get them?

If so, that could be the source of your problems. Whilst not impossible, it is quite difficult to render web pages on-the-fly because HTML simply wasn't designed to work that way.

If not, then you are going to have to describe or post the overall operation of the application, rather than just keep posting the same basic snippet.

I have looked at your earlier posts but as I do not understand what you are trying to achieve, it's hard to begin to help you. I don't the specific details of the data, but a clear overview of the dataflows is essential. Also, how long is it takling currently, and what is your target?


Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.
"Too many [] have been sedated by an oppressive environment of political correctness and risk aversion."
  • Comment on Re^3: Parallel Search using Thread::Pool

Replies are listed 'Best First'.
Re^4: Parallel Search using Thread::Pool
by shanu_040 (Sexton) on Jun 02, 2009 at 04:56 UTC
    Hi,
    I am working on a MeatSearch tool.
    A metasearch tool is a software application that:
    • uses multiple protocols
    • to perform simultaneous searches
    • across multiple heterogeneous electronic information resources
    • from a single point of entry.

    How do metasearch tools work?

    Metasearch software makes use of the search functionality built into each target resource it is searching. In general terms, a metasearch application goes through a series of steps to search multiple resources simultaneously and return results to the user.
    Metasearch software:

    (1) converts the user’s search into a query that can be understood by the built-in search of each of the target resources chosen to be searched.I called it Connectors

    (2) broadcasts the translated query to the selected target resources.

    (3) simultaneously retrieves sets of results from all reachable target resources.

    (4) formats results into a canonical internal format to allow for further manipulation by the metasearch software.

    (5a) displays the results from each resource in its own ranked or sorted list.

    OR

    (5b) displays the results in one merged list, ranked or sorted in some fashion.

    • What type of application is it?
      It is a Web Application.
    • What are you searching?
      Multiple heterogeneous electronic information resources i.e. DOAJ, Publisher's Databases. Yes, I can say it searches Web pages
    • You mention 50 searches and the possibility of 10 concurrent instances.
      Yes, each instance may search for 50 resources.

    I broadcast the well formated search query to different sources and fetch the from each target sourcesconnector using WWW::Mechanize. To Broadcast the search I am using SOAP::Lite and Parallel::ForkManager.
    For each target source we have written a code(Connector),
    Which Does the following
    • creates the WWW::Mechanize object
    • creates search url, and get the search results.(HTML content using WWW::Mechanize->content)
    • Filter the HTML and Other un-wanted information, create a Record Object for each record.
    • Return the reference to the recordSet Object.

    Now, I need help on the following:
    1. Should I use Process or Thread?
    2. How to display the results as they are available from any source? application must not wait for all.
    3. How to merge all results, when I am asking for Incremental display.
    4. First I want to prepare a flow diagram. Can I get the help?
    Looking forward for your response. Thanks

      2. How to display the results as they are available from any source? application must not wait for all.

      3. How to merge all results, when I am asking for Incremental display.

      You need to employ the services of a seriously experienced web architect. Serving HTML incrementally requires detailed knowledge of both the webserver and the browsers you are seeking to target. I have neither.

      Fetching the data from 50 sources concurrently and merging the required results back together is relatively trivial. One thread per source and a common queue. Results are posted to the queue by the threads and the cgi thread reads it off, formats it and serves it.

      The difficult part is the web handling. HTTP is a request-response protocol. The browser sends a request; the server sends a response; the browser displays it. And the server won't send anything else until the browser sends another request. So to display results incrementally, you have to arrange for the browser to re-request to get updates.

      That can be done with meta tags, javascript or by having the user hit refresh, but then the new response will overwrite the browser's display obliterating what was sent the first time. So for the user to see the results build up incrementally, you have re-send any results you sent the last time plus any additions. But that means that the server has to remember what it sent--and to whom. But as HTTP is connectionless, that means having a means to identify each user and persistent storage to record what to send to whom. And how you go about doing that will depend upon what web server you use; what session mechanism you use; what persistent storage you have; what web-app software/framework/development tools you use. etc. etc.

      There's also the problem of how your webserver is going to handle running 500 concurrent Perl threads? From my very limited understanding of Apache, it doesn't like (Perl) threads much. Less so if you are also using mod_Perl or FastCGI.

      If I were trying to do this, I would have the webserver hand-off the query to a dedicated Perl process. Something like this:

      1. Webserver receives a request for the query form and serves it.
      2. When it receives the completed form, it validates the query and if it is good, it spawns a separate instance of Perl. Passing the query parameters and retrieving a port number that the new Perl instance will listen on.
      3. It then send the browser a redirect to that port number.

        The browser is now talking directly to a Perl instance dealing with its particular query.

        That Perl instance starts the 50 threads and issues the requests.

      4. When the Perl process receives the redirected request on the port it opened, it formats any results it has received so far and sends the HTML with a meta refresh tag.
      5. Each time the refresh request is received, it adds any new results, to those accumulated last time, and re-sends the response.
      6. Repeat till done.
      7. Redirect the user back to the web server.
      8. Terminate the Perl process.

      This way, as each query is being serviced by a dedicated Perl instance, there is no possibility of mixing up the users and no need for persistent storage that would need to be cleaned up. When the user quits or the session times out; the process terminates and everything is cleaned up automatically.

      But I'm not a web guy, so take that with a huge pinch of salt and pay someone for their advice and knowledge. Choose them carefully.


      Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
      "Science is about questioning the status quo. Questioning authority".
      In the absence of evidence, opinion is indistinguishable from prejudice.
        Hi,
        Instead of instantiating a new perl instance of perl on other port can I use SOAP::Lite?
        Actually I am using SOAP::Lite to create a Parallel Server so that any number of threads or process(I am not sure what am I going to use?) can be created on different web server at different Physical Location.()
        Now If you look at it, there are two perl instances running<bt>1. Main Perl Instance i.e. (main.pl)
        2. Which runs on separate server using SOAP::Lite(soap_para_search.pl).
        In main.pl, I am creating session, cookies, search object for each sources to search etc. Now I can pass search objects and query using SOAP::Lite to soap_para_search.pl.
        Now my queries are:(for threads)
        1. Can I use same server(main.pl) as Parallel Server? yes, I am using mod_perl.
        2. What will be the sample code? I am also using TT2 for display.
        Thanks