Re^5: How to speed up my Html parsing program? (Concurrently run Subroutines?)

I didn't realize the thread would automatically call getHtml whenever there was something in the queue.

That's not what happens either. Nothing is "automatically called". After line 2, 10 copies of getHTML() are already running.

The statement threads->create( \&getHTML, $Qlinks );

is effectively the same as getHTML( $Qlinks );. Ie. it is an explicite manual call to getHTML()

The difference is, that once getHTML() is running, control returns to the calling thread straight away, and both the new thread and the calling thread continue running at the same time. And by the end of that second line, there are 11 threads running. 10 running copies of getHTML(), and the main thread which now moves onto line 3.

But the ten getHTML() threads aren't doing very much when they first start, because when they try to ->dequeue() a link, there are none there (nothing has pushed anything on the queue yet), so the ->dequeue() blocks:

sub getHTML {
    my( $Qin ) = @_;

    ## Read a link
    ## The threads block here until links are available.
    while( my $link = $Qin->dequeue ) { 
        ## Fetch the content
        my $content = get $link;

        ## And process it
        retrieveInfo( $content );
    }
}
[download]

It is only when the main thread reaches the third line and calls

sub listParse {
    my( $url,  $Qout ) = @_;

    ## Get the first page
    my $content = get $url;

    ## find the links and push them onto the queue
    while( $content =~ m[...]g ) {
        $Qout->enqueue( $_ );   ### Links added to queue here.
    }
    ## Push 1 undef per thread to terminate their loops
    $Qout->enqueue( (undef) x $noOfThreads );
}
[download]

and after it has fetched the first page and found the first link, that it pushes that link onto the queue. It is only at that point that one of the getHTML() threads will wake up (unblock) and get that first link.

From then on, as listParse() finds and pushes more links, each of the remaining 9 getHTML() threads will get woken up (unblock) and receive (dequeue() the next 9 links. Once all the getHTML() threads are do stuff--fetching the links and parsing out the info--the main thread continues pushing new links as it finds them (concurrently), until it is done.

These new links (11..N) just accumulate in the queue until one of the 10 threads finishes the link it is processing, and loops back to get another. Each thread will run as fast as the server response time, network latency etc. allow it to, and whichever is finished processing its current link first, will be the one that grabs the next. And so on until completion.

Meanwhile, once listParse() has found all the links, its while loop will terminate, and it will fall through to the line where it pushes one undef per thread. These act as flags or signals to the getHTML() threads telling them that there is no more work to do, and causing their while loops to terminate (in turn) when they reach that point in the queue. listParse() is now finished and returns and so falls through to the fourth line in the main thread.

## Join the threads
$_->join for @threads;
[download]

This calls each of the thread objects in turn and blocks until that thread completes. Once that line completes, all the links have been found and queued; dequeued by a thread, fetched and processed; and the threads terminated.

All that is left to do is exit. Ie. fall off the end of the program in the normal way.

As for the the trade off between using 10 LWP instances versus one. It is exactly that, you are trading a little memory (10 x a couple of MB) for speed. Any Perl program, be it forked, threaded or event driven will need as many separate instances of LWP as it will use concurrently.

Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.

"Science is about questioning the status quo. Questioning authority".

In the absence of evidence, opinion is indistinguishable from prejudice.

"Too many [] have been sedated by an oppressive environment of political correctness and risk aversion."

Comment on Re^5: How to speed up my Html parsing program? (Concurrently run Subroutines?) Select or Download Code

Replies are listed 'Best First'.
Re^6: How to speed up my Html parsing program? (Concurrently run Subroutines?) by BobFishel (Acolyte) on Jan 08, 2009 at 04:50 UTC
Browser, Thanks so much for your help. This is an excellent explanation, thanks for taking the time to write this up. I've got most of my program implemented and it is running super fast (I even managed to generate a few 503 errors before I dialed down the number of threads) Now I'm trying to wrap my head around how to do the final part program. The stuff I parse in retrieve_info needs to be written into a excel::writespreadsheet object. The only catch is I need to output them in the same order they appeared in the original list page. I've tried several ways of doing this that have not worked and can't find a way that will. My last resort would be to increment a variable each time in the while loop of listParse and pass it to getHTML and then to RetreiveInfo, but it seems like there should be a more elegant solution. Not a huge deal to do it my way but I figure as long as I have someone's brain to tap I might as well make good use of it! By the way, is there a method on Perl Monks of giving a "thanks" or a "+" or "positive" or anything of the sort? I poked around but couldn't find anything. -Bob	[reply]
Re^7: How to speed up my Html parsing program? (Concurrently run Subroutines?) by BrowserUk (Patriarch) on Jan 08, 2009 at 05:42 UTC
The stuff I parse in retrieve_info needs to be written into a excel::writespreadsheet object. The only catch is I need to output them in the same order they appeared in the original list page. My first comment is keep Win32::OLE far away from threads! It might be safe to call from one thread--or not. I don't near that behemoth. Anyway, that lends itself to my preferred solution (of several possibilities). Essentially, exactly what you've suggested. Prefix the links with an incrementing serial no as they are found. Pass that serial no through the various stages of retrieval. Write the info to STDOUT prefixed by the serial no. Redirected STDOUT through the system sort utility and a pipe to another copy of perl running a script that does the OLE stuff. That script just reads the info one line at a time from its STDIN (now sorted back into the original order) and inserts it into Excel. The modified script would look something like this: #! perl -slw use strict; use threads; use threads::shared; use Thread::Queue; use LWP::Simple; ## A semaphore to serialise access to stdout my $sem : shared; ## (2 to 4) * $noOfCores ## depending upon your bandwidth, server response times ## and how hard you feel you should hit them! my $noOfThreads = 10; my $firstURL = 'http://www.example.com/thePage.htm'; sub retrieveInfo { my( $serialNo, $content ) = @_; my $info = parseContent( $content ); ## do something with the info ## Obtain exclusive access to STDOUT lock $sem; ## Print the info prefixed with the serial no. printf "%05d:%s", $info; return; } sub listParse { my( $url, $Qout ) = @_; ## Serial no incremented each time a link is found. my $serialNo = 0; ## Get the first page my $content = get $url; ## find the links and push them onto the queue while( $content =~ m[...]g ) { ## Queue the data pre-fixed by its serial no $Qout->enqueue( ++$serialNo . ':' . $_ ); } ## Push 1 undef per thread to terminate their loops $Qout->enqueue( (undef) x $noOfThreads ); } sub getHTML { my( $Qin ) = @_; ## Read a link while( $Qin->dequeue ) { ## Split off and remember the serial no my( $serialNo, $link ) = split ':', $_, 2; ## Fetch the content my $content = get $link; ## And process it, passing along the serial no retrieveInfo( $serialNo, $content ); } } ## Redirect STDOUT via teh system sort utility ## and via another pipe to the OLE/Excel script open STDOUT, '\|sort \| perl excelOLE.pl' or die $!; ## Create the queue my $Qlinks = new Thread::Queue; ## Start the threads. my @threads = map { threads->create( \&getHTML, $Qlinks ); } 1 .. $noOfThreads; ## Fetch and parse the first page; queue the links listParse( $firstURL, $Qlinks ); ## Join the threads $_->join for @threads; ## Ensure the pipe gets flushed ## so that sort can do its thing close STDOUT; [download] The unfortunate consequence of this is that the sort won't start until the final link has been fetched and parsed. Another approach would be to insertion sort the info as they are processed into another shared array, and have the main thread monitor that and output/OLE the info in order as it becomes available. This woudl allow some overlap of the OLE processing with the info retrieval and parsing. As the links will be pulled off the queue in the original order, they will become available in nearly sorted order--barring one or more servers that are grossly slower than the rest, so the potential for the reordered output mostly keeping up with the fetching is good. The downside of this approach is that it complicates things, and requires the use of 'condition signalling' on shared variables. I avoid this unless absolutely required, as I find the APIs confusing and unreliable. Given your newness to threading, I'd suggest sticking with the first approach above, and only consider the latter if you really need to further improve throughput. Belated thought: Do you really need to "output them in the original order"? Or could you get away with inserting them into the spreadsheet in the right positions, even if they are added out of sequence? If so, you could skip the sort stage of the pipeline and have the OLE script just use the prefixes to put them into the right places. Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. "Science is about questioning the status quo. Questioning authority". In the absence of evidence, opinion is indistinguishable from prejudice. "Too many [] have been sedated by an oppressive environment of political correctness and risk aversion."	[reply] [d/l]
Re^8: How to speed up my Html parsing program? (Concurrently run Subroutines?) by BobFishel (Acolyte) on Jan 08, 2009 at 14:22 UTC
Browser, as always thanks for the help. All I need to do is have them outputted in the right positions so need need for the sort. However since I will be distributing this via pp as an exe I can't count on launching a separate process in the manner you described. What I've done which seems to be working ok is pushing the data into a shared array (using locks to ensure that the variable doesn't get corrupted) However since I'm not operating on the array until the end of my program it gets quite large. I've been thinking about a solution over my morning coffe and here is my thought: Push the data gotten in RetrieveInfo into a shared and locked array. Then before the while loop in GetHml begins another iteration, check the size of array and if $arraysize >= $Arbitrary_value, lock the array and call output() which will write the contents of the array and then flush it's contents. What do you think?	[reply]
Re^8: How to speed up my Html parsing program? (Concurrently run Subroutines?) by BobFishel (Acolyte) on Jan 08, 2009 at 14:43 UTC
Also spreadsheet::WriteExcel doesn't use Win32::OLE It simply creates a binary file in the excel format.	[reply]
Re^9: How to speed up my Html parsing program? (Concurrently run Subroutines?) by BrowserUk (Patriarch) on Jan 08, 2009 at 14:58 UTC