in reply to Re: How to speed up my Html parsing program? (Concurrently run Subroutines?)
in thread How to speed up my Html parsing program? (Concurrently run Subroutines?)

Thank you for this it's very informative. I think I understand whats going on here but my biggest question is where in this code does it make multiple threads get executed at once? It seems to me that gethtml is never called.

Thanks for the help! -Bob (monk in training)
  • Comment on Re^2: How to speed up my Html parsing program? (Concurrently run Subroutines?)

Replies are listed 'Best First'.
Re^3: How to speed up my Html parsing program? (Concurrently run Subroutines?)
by BrowserUk (Patriarch) on Jan 07, 2009 at 10:19 UTC
    I think I understand whats going on here ... It seems to me that gethtml is never called.

    Hm. Maybe not so much :)

    The heart of the program is these four (extended) lines:

    ## Create the queue my $Qlinks = new Thread::Queue; ## Start the threads. my @threads = map { threads->create( \&getHTML, $Qlinks ); } 1 .. $noOfThreads; ## Fetch and parse the first page; queue the links listParse( $firstURL, $Qlinks ); ## Join the threads $_->join for @threads;

    The second of those lines creates 10 threads each running an independant copy of getHTML() and each is passed a copy of the queue handle. Each thread sits waiting (blocking) reading the queue, for a link to become available. Ie. they do nothing until the next line runs.

    listParse() also get a handle to the queue, and whenever it finds a link, it posts it to the queue, and one of the threads (it doesn't matter which as the are all identical) will get it and do its thing.

    When listParse() has finished finding the links, it pushes one undef per thread and then returns.

    The fourth line, waits for all the threads to finish, at which point the only thing left to do is exit.


    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.
      Thanks for the response!

      OK you caught me. I definitely didn't understand it!

      I didn't realize the thread would automatically call getHtml whenever there was something in the queue.

      My next question is thus:

      Currently I am passing my LWP Useragent by reference in order to not recreate one everytime I need it. Obviously with threading this won't work. How can I weigh the cost of the overhead of mirroring (creating?) multiple instances of objects with the speed increase of concurrent processing?

      Thanks,

      -Bob
        I didn't realize the thread would automatically call getHtml whenever there was something in the queue.

        That's not what happens either. Nothing is "automatically called". After line 2, 10 copies of getHTML() are already running.

        The statement threads->create( \&getHTML, $Qlinks );

        is effectively the same as getHTML( $Qlinks );. Ie. it is an explicite manual call to getHTML()

        The difference is, that once getHTML() is running, control returns to the calling thread straight away, and both the new thread and the calling thread continue running at the same time. And by the end of that second line, there are 11 threads running. 10 running copies of getHTML(), and the main thread which now moves onto line 3.

        But the ten getHTML() threads aren't doing very much when they first start, because when they try to ->dequeue() a link, there are none there (nothing has pushed anything on the queue yet), so the ->dequeue() blocks:

        sub getHTML { my( $Qin ) = @_; ## Read a link ## The threads block here until links are available. while( my $link = $Qin->dequeue ) { ## Fetch the content my $content = get $link; ## And process it retrieveInfo( $content ); } }

        It is only when the main thread reaches the third line and calls

        sub listParse { my( $url, $Qout ) = @_; ## Get the first page my $content = get $url; ## find the links and push them onto the queue while( $content =~ m[...]g ) { $Qout->enqueue( $_ ); ### Links added to queue here. } ## Push 1 undef per thread to terminate their loops $Qout->enqueue( (undef) x $noOfThreads ); }

        and after it has fetched the first page and found the first link, that it pushes that link onto the queue. It is only at that point that one of the getHTML() threads will wake up (unblock) and get that first link.

        From then on, as listParse() finds and pushes more links, each of the remaining 9 getHTML() threads will get woken up (unblock) and receive (dequeue() the next 9 links. Once all the getHTML() threads are do stuff--fetching the links and parsing out the info--the main thread continues pushing new links as it finds them (concurrently), until it is done.

        These new links (11..N) just accumulate in the queue until one of the 10 threads finishes the link it is processing, and loops back to get another. Each thread will run as fast as the server response time, network latency etc. allow it to, and whichever is finished processing its current link first, will be the one that grabs the next. And so on until completion.

        Meanwhile, once listParse() has found all the links, its while loop will terminate, and it will fall through to the line where it pushes one undef per thread. These act as flags or signals to the getHTML() threads telling them that there is no more work to do, and causing their while loops to terminate (in turn) when they reach that point in the queue. listParse() is now finished and returns and so falls through to the fourth line in the main thread.

        ## Join the threads $_->join for @threads;

        This calls each of the thread objects in turn and blocks until that thread completes. Once that line completes, all the links have been found and queued; dequeued by a thread, fetched and processed; and the threads terminated.

        All that is left to do is exit. Ie. fall off the end of the program in the normal way.

        As for the the trade off between using 10 LWP instances versus one. It is exactly that, you are trading a little memory (10 x a couple of MB) for speed. Any Perl program, be it forked, threaded or event driven will need as many separate instances of LWP as it will use concurrently.


        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        "Science is about questioning the status quo. Questioning authority".
        In the absence of evidence, opinion is indistinguishable from prejudice.