in reply to How to speed up my Html parsing program? (Concurrently run Subroutines?)

Something like this might work for you:

#! perl -slw use strict; use threads; use Thread::Queue; use LWP::Simple; ## (2 to 4) * $noOfCores ## depending upon your bandwidth, server response times ## and how hard you feel you should hit them! my $noOfThreads = 10; my $firstURL = 'http://www.example.com/thePage.htm'; sub retrieveInfo { my( $content ) = @_; my $info = parseContent( $content ); ## do something with the info return; } sub listParse { my( $url, $Qout ) = @_; ## Get the first page my $content = get $url; ## find the links and push them onto the queue while( $content =~ m[...]g ) { $Qout->enqueue( $_ ); } ## Push 1 undef per thread to terminate their loops $Qout->enqueue( (undef) x $noOfThreads ); } sub getHTML { my( $Qin ) = @_; ## Read a link while( my $link = $Qin->dequeue ) { ## Fetch the content my $content = get $link; ## And process it retrieveInfo( $content ); } } ## Create the queue my $Qlinks = new Thread::Queue; ## Start the threads. my @threads = map { threads->create( \&getHTML, $Qlinks ); } 1 .. $noOfThreads; ## Fetch and parse the first page; queue the links listParse( $firstURL, $Qlinks ); ## Join the threads $_->join for @threads;

Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.
"Too many [] have been sedated by an oppressive environment of political correctness and risk aversion."
  • Comment on Re: How to speed up my Html parsing program? (Concurrently run Subroutines?)
  • Download Code

Replies are listed 'Best First'.
Re^2: How to speed up my Html parsing program? (Concurrently run Subroutines?)
by BobFishel (Acolyte) on Jan 06, 2009 at 23:55 UTC
    Thank you for this it's very informative. I think I understand whats going on here but my biggest question is where in this code does it make multiple threads get executed at once? It seems to me that gethtml is never called.

    Thanks for the help! -Bob (monk in training)
      I think I understand whats going on here ... It seems to me that gethtml is never called.

      Hm. Maybe not so much :)

      The heart of the program is these four (extended) lines:

      ## Create the queue my $Qlinks = new Thread::Queue; ## Start the threads. my @threads = map { threads->create( \&getHTML, $Qlinks ); } 1 .. $noOfThreads; ## Fetch and parse the first page; queue the links listParse( $firstURL, $Qlinks ); ## Join the threads $_->join for @threads;

      The second of those lines creates 10 threads each running an independant copy of getHTML() and each is passed a copy of the queue handle. Each thread sits waiting (blocking) reading the queue, for a link to become available. Ie. they do nothing until the next line runs.

      listParse() also get a handle to the queue, and whenever it finds a link, it posts it to the queue, and one of the threads (it doesn't matter which as the are all identical) will get it and do its thing.

      When listParse() has finished finding the links, it pushes one undef per thread and then returns.

      The fourth line, waits for all the threads to finish, at which point the only thing left to do is exit.


      Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
      "Science is about questioning the status quo. Questioning authority".
      In the absence of evidence, opinion is indistinguishable from prejudice.
        Thanks for the response!

        OK you caught me. I definitely didn't understand it!

        I didn't realize the thread would automatically call getHtml whenever there was something in the queue.

        My next question is thus:

        Currently I am passing my LWP Useragent by reference in order to not recreate one everytime I need it. Obviously with threading this won't work. How can I weigh the cost of the overhead of mirroring (creating?) multiple instances of objects with the speed increase of concurrent processing?

        Thanks,

        -Bob