in reply to Re^6: How to speed up my Html parsing program? (Concurrently run Subroutines?)
in thread How to speed up my Html parsing program? (Concurrently run Subroutines?)
The stuff I parse in retrieve_info needs to be written into a excel::writespreadsheet object. The only catch is I need to output them in the same order they appeared in the original list page.
My first comment is keep Win32::OLE far away from threads! It might be safe to call from one thread--or not. I don't near that behemoth.
Anyway, that lends itself to my preferred solution (of several possibilities). Essentially, exactly what you've suggested.
Redirected STDOUT through the system sort utility and a pipe to another copy of perl running a script that does the OLE stuff.
The modified script would look something like this:
#! perl -slw use strict; use threads; use threads::shared; use Thread::Queue; use LWP::Simple; ## A semaphore to serialise access to stdout my $sem : shared; ## (2 to 4) * $noOfCores ## depending upon your bandwidth, server response times ## and how hard you feel you should hit them! my $noOfThreads = 10; my $firstURL = 'http://www.example.com/thePage.htm'; sub retrieveInfo { my( $serialNo, $content ) = @_; my $info = parseContent( $content ); ## do something with the info ## Obtain exclusive access to STDOUT lock $sem; ## Print the info prefixed with the serial no. printf "%05d:%s", $info; return; } sub listParse { my( $url, $Qout ) = @_; ## Serial no incremented each time a link is found. my $serialNo = 0; ## Get the first page my $content = get $url; ## find the links and push them onto the queue while( $content =~ m[...]g ) { ## Queue the data pre-fixed by its serial no $Qout->enqueue( ++$serialNo . ':' . $_ ); } ## Push 1 undef per thread to terminate their loops $Qout->enqueue( (undef) x $noOfThreads ); } sub getHTML { my( $Qin ) = @_; ## Read a link while( $Qin->dequeue ) { ## Split off and remember the serial no my( $serialNo, $link ) = split ':', $_, 2; ## Fetch the content my $content = get $link; ## And process it, passing along the serial no retrieveInfo( $serialNo, $content ); } } ## Redirect STDOUT via teh system sort utility ## and via another pipe to the OLE/Excel script open STDOUT, '|sort | perl excelOLE.pl' or die $!; ## Create the queue my $Qlinks = new Thread::Queue; ## Start the threads. my @threads = map { threads->create( \&getHTML, $Qlinks ); } 1 .. $noOfThreads; ## Fetch and parse the first page; queue the links listParse( $firstURL, $Qlinks ); ## Join the threads $_->join for @threads; ## Ensure the pipe gets flushed ## so that sort can do its thing close STDOUT;
The unfortunate consequence of this is that the sort won't start until the final link has been fetched and parsed.
Another approach would be to insertion sort the info as they are processed into another shared array, and have the main thread monitor that and output/OLE the info in order as it becomes available. This woudl allow some overlap of the OLE processing with the info retrieval and parsing.
As the links will be pulled off the queue in the original order, they will become available in nearly sorted order--barring one or more servers that are grossly slower than the rest, so the potential for the reordered output mostly keeping up with the fetching is good.
The downside of this approach is that it complicates things, and requires the use of 'condition signalling' on shared variables. I avoid this unless absolutely required, as I find the APIs confusing and unreliable. Given your newness to threading, I'd suggest sticking with the first approach above, and only consider the latter if you really need to further improve throughput.
Belated thought: Do you really need to "output them in the original order"? Or could you get away with inserting them into the spreadsheet in the right positions, even if they are added out of sequence? If so, you could skip the sort stage of the pipeline and have the OLE script just use the prefixes to put them into the right places.
|
|---|
| Replies are listed 'Best First'. | |
|---|---|
|
Re^8: How to speed up my Html parsing program? (Concurrently run Subroutines?)
by BobFishel (Acolyte) on Jan 08, 2009 at 14:22 UTC | |
|
Re^8: How to speed up my Html parsing program? (Concurrently run Subroutines?)
by BobFishel (Acolyte) on Jan 08, 2009 at 14:43 UTC | |
by BrowserUk (Patriarch) on Jan 08, 2009 at 14:58 UTC |