How to speed up my Html parsing program? (Concurrently run Subroutines?)

BobFishel has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: How to speed up my Html parsing program? (Concurrently run Subroutines?) by BrowserUk (Patriarch) on Jan 06, 2009 at 07:46 UTC
Something like this might work for you: #! perl -slw use strict; use threads; use Thread::Queue; use LWP::Simple; ## (2 to 4) * $noOfCores ## depending upon your bandwidth, server response times ## and how hard you feel you should hit them! my $noOfThreads = 10; my $firstURL = 'http://www.example.com/thePage.htm'; sub retrieveInfo { my( $content ) = @_; my $info = parseContent( $content ); ## do something with the info return; } sub listParse { my( $url, $Qout ) = @_; ## Get the first page my $content = get $url; ## find the links and push them onto the queue while( $content =~ m[...]g ) { $Qout->enqueue( $_ ); } ## Push 1 undef per thread to terminate their loops $Qout->enqueue( (undef) x $noOfThreads ); } sub getHTML { my( $Qin ) = @_; ## Read a link while( my $link = $Qin->dequeue ) { ## Fetch the content my $content = get $link; ## And process it retrieveInfo( $content ); } } ## Create the queue my $Qlinks = new Thread::Queue; ## Start the threads. my @threads = map { threads->create( \&getHTML, $Qlinks ); } 1 .. $noOfThreads; ## Fetch and parse the first page; queue the links listParse( $firstURL, $Qlinks ); ## Join the threads $_->join for @threads; [download] Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. "Science is about questioning the status quo. Questioning authority". In the absence of evidence, opinion is indistinguishable from prejudice. "Too many [] have been sedated by an oppressive environment of political correctness and risk aversion."	[reply] [d/l]
Re^2: How to speed up my Html parsing program? (Concurrently run Subroutines?) by BobFishel (Acolyte) on Jan 06, 2009 at 23:55 UTC
Thank you for this it's very informative. I think I understand whats going on here but my biggest question is where in this code does it make multiple threads get executed at once? It seems to me that gethtml is never called. Thanks for the help! -Bob (monk in training)	[reply]
Re^3: How to speed up my Html parsing program? (Concurrently run Subroutines?) by BrowserUk (Patriarch) on Jan 07, 2009 at 10:19 UTC
I think I understand whats going on here ... It seems to me that gethtml is never called. Hm. Maybe not so much :) The heart of the program is these four (extended) lines: `## Create the queue my $Qlinks = new Thread::Queue; ## Start the threads. my @threads = map { threads->create( \&getHTML, $Qlinks ); } 1 .. $noOfThreads; ## Fetch and parse the first page; queue the links listParse( $firstURL, $Qlinks ); ## Join the threads $_->join for @threads;` [download] The second of those lines creates 10 threads each running an independant copy of `getHTML()` and each is passed a copy of the queue handle. Each thread sits waiting (blocking) reading the queue, for a link to become available. Ie. they do nothing until the next line runs. `listParse()` also get a handle to the queue, and whenever it finds a link, it posts it to the queue, and one of the threads (it doesn't matter which as the are all identical) will get it and do its thing. When `listParse()` has finished finding the links, it pushes one undef per thread and then returns. The fourth line, waits for all the threads to finish, at which point the only thing left to do is exit. Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. "Science is about questioning the status quo. Questioning authority". In the absence of evidence, opinion is indistinguishable from prejudice. "Too many [] have been sedated by an oppressive environment of political correctness and risk aversion."	[reply] [d/l] [select]
Re^4: How to speed up my Html parsing program? (Concurrently run Subroutines?) by BobFishel (Acolyte) on Jan 07, 2009 at 13:21 UTC
Re^5: How to speed up my Html parsing program? (Concurrently run Subroutines?) by BrowserUk (Patriarch) on Jan 07, 2009 at 15:17 UTC
Some notes below your chosen depth have not been shown here
Re: How to speed up my Html parsing program? (Concurrently run Subroutines?) by Marshall (Canon) on Jan 06, 2009 at 10:53 UTC
This is just a simple outline of a web gizmo that could hit a lot of pages simultaneously. I am sure that the one second reponse time per page is due to web, not the Perl parsing of the data. I haven't written a multi-process LWP web app as the normal sites that I have LWP clients for would probably be upset if I wacked 'em more than a few times per second and I try to be polite. There are also limits as to how many connections you can have open at once...I've never come close in Windows, so I don't know. The poster asked for psuedo code, and here is one attempt with no error or time out escapes. Fork() in Windows is weird and is actually a thread instead of a separate process. ---- client (single program in this case): maintains a list of requests that it wants answers to.. (replies to those requests are required).. Maybe this is just a hash table with URL's? put first request(s) onto request queue then talk_2_server; talk_2_server { while (a request hasn't been sent or some request hasn't been answered...) { if (server has reply to a previous request) { take it off outstanding queue, and deal with it.. this action will generate additional new requests that go onto queue...maybe requests for 20 sub-pages ..be careful .. you might overload you are talking to! } while (I have a new request in queue) { send it to server}; # might want to think about a "throttle" if hitting # same website } } maybe I'm done or I need to loop and stuff more things onto request queue and talk_2_server again... --- server: I see a new request, fork child to deal with it. (Like maybe get the info from URL X). I then wait for next request. --- child: I've got some answer, so I want to send result to client. I cooperate with other children to so that I can send an "atomic" reponse on the pipe back to client via some kind of locking mechanism. Then my job's done, I die. Message format could be as simple as first line in the URL you requested...followed by some html response. [download]	[reply] [d/l]
Re^2: How to speed up my Html parsing program? (Concurrently run Subroutines?) by BobFishel (Acolyte) on Jan 06, 2009 at 16:49 UTC
Good stuff here thanks for the reply!! A couple of questions, and my apologies if these are simple/dumb but I have 0 experience with fork, to date most of my programs have been very simple. 1. Where in your pseudo code would I be forking a process? 2. Does the queue maintain a list of requests to be passed to a forked process once the process completes? 3. In a forked process do I need to be careful with passing by reference? Currently I'm passing a few variables (the LWP downloader and a spreadsheet::writeexcel method for output) by reference so as to reduce overhead. 4. Also do I need to be careful about multiple write calls on my workbook? Currently I am doing the outputing after processing each page with information. Thanks for your help so far!	[reply]
Re: How to speed up my Html parsing program? (Concurrently run Subroutines?) by sflitman (Hermit) on Jan 06, 2009 at 06:58 UTC
You could try Parallel::Simple and run multiple queries at once, but there's no real speed-up unless you're running on multiple cores. It's probably more worthwhile to take a look at your code and see how to improve it. If you're doing a lot of regular expression matching you might want to call `study` on your string as that does some groundwork which might speed things up. HTH, SSF	[reply] [d/l]
Re^2: How to speed up my Html parsing program? (Concurrently run Subroutines?) by eye (Chaplain) on Jan 06, 2009 at 07:37 UTC
This page contains a list of links which then need to be retrieved by the get_page_html() method and there content passed to retrieve_info(); If the get_page_html() method retrieves pages over a network (rather than from disk), there is great potential for improving performance with forking or threads. In a single process/thread, network latency is additive. With multiple processes/threads, latency costs can run concurrently.	[reply]
Re^3: How to speed up my Html parsing program? (Concurrently run Subroutines?) by BobFishel (Acolyte) on Jan 06, 2009 at 16:34 UTC
Yes, it retrieves it over a network. I'm definitly going to look into forking, after doing some reading last night it seems this is my best bet at this point. Now I just need to figure out how to keep my variables indedpendant. I haven't dug into the code in the responses below yet but from the looks of it they seem like a great starting point!	[reply]
Re^2: How to speed up my Html parsing program? (Concurrently run Subroutines?) by BobFishel (Acolyte) on Jan 06, 2009 at 18:39 UTC
Wow I just got some time to look into study and I feel I will definitely make some gains incorporating this into my program. Thanks!	[reply]