BobFishel has asked for the wisdom of the Perl Monks concerning the following question:

So I'm writing a program to parse certain information out of html pages.
Currently the program consists (in a simplified manner) of one function called List_Parse() which takes the page to parse.
This page contains a list of links which then need to be retrieved by the get_page_html() method and there content passed to retrieve_info();

Once all the links in the original page passed to List_Parse are exhausted, I increase the pg variable in the query part of the url and again pass this page to List_Parse() and repeat the process.

Currently I have this done using loops and recursion and I can grab and parse about 1 page a second. This would seem to be a decent speed, however parsing 1000 pages takes roughly 15 minutes.

I am looking for a way to process multiple pages concurrently to speed up the process (I am open to other ways of speeding it up as well of course) but I am unsure of how to do this.
I know that multithreading/forking/somehow parallel processing should be able to do this (run multiple instances of subroutines concurrently) but I haven't found a primer I can wrap my head around.

If someone could maybe shoot some pseudo code in my direction or offer any tips on how to do this (I'm currently using Activestate and my program will only be running on windows and MAYBE on Mac)

Thanks

-Bob
  • Comment on How to speed up my Html parsing program? (Concurrently run Subroutines?)

Replies are listed 'Best First'.
Re: How to speed up my Html parsing program? (Concurrently run Subroutines?)
by BrowserUk (Patriarch) on Jan 06, 2009 at 07:46 UTC

    Something like this might work for you:

    #! perl -slw use strict; use threads; use Thread::Queue; use LWP::Simple; ## (2 to 4) * $noOfCores ## depending upon your bandwidth, server response times ## and how hard you feel you should hit them! my $noOfThreads = 10; my $firstURL = 'http://www.example.com/thePage.htm'; sub retrieveInfo { my( $content ) = @_; my $info = parseContent( $content ); ## do something with the info return; } sub listParse { my( $url, $Qout ) = @_; ## Get the first page my $content = get $url; ## find the links and push them onto the queue while( $content =~ m[...]g ) { $Qout->enqueue( $_ ); } ## Push 1 undef per thread to terminate their loops $Qout->enqueue( (undef) x $noOfThreads ); } sub getHTML { my( $Qin ) = @_; ## Read a link while( my $link = $Qin->dequeue ) { ## Fetch the content my $content = get $link; ## And process it retrieveInfo( $content ); } } ## Create the queue my $Qlinks = new Thread::Queue; ## Start the threads. my @threads = map { threads->create( \&getHTML, $Qlinks ); } 1 .. $noOfThreads; ## Fetch and parse the first page; queue the links listParse( $firstURL, $Qlinks ); ## Join the threads $_->join for @threads;

    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.
      Thank you for this it's very informative. I think I understand whats going on here but my biggest question is where in this code does it make multiple threads get executed at once? It seems to me that gethtml is never called.

      Thanks for the help! -Bob (monk in training)
        I think I understand whats going on here ... It seems to me that gethtml is never called.

        Hm. Maybe not so much :)

        The heart of the program is these four (extended) lines:

        ## Create the queue my $Qlinks = new Thread::Queue; ## Start the threads. my @threads = map { threads->create( \&getHTML, $Qlinks ); } 1 .. $noOfThreads; ## Fetch and parse the first page; queue the links listParse( $firstURL, $Qlinks ); ## Join the threads $_->join for @threads;

        The second of those lines creates 10 threads each running an independant copy of getHTML() and each is passed a copy of the queue handle. Each thread sits waiting (blocking) reading the queue, for a link to become available. Ie. they do nothing until the next line runs.

        listParse() also get a handle to the queue, and whenever it finds a link, it posts it to the queue, and one of the threads (it doesn't matter which as the are all identical) will get it and do its thing.

        When listParse() has finished finding the links, it pushes one undef per thread and then returns.

        The fourth line, waits for all the threads to finish, at which point the only thing left to do is exit.


        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        "Science is about questioning the status quo. Questioning authority".
        In the absence of evidence, opinion is indistinguishable from prejudice.
Re: How to speed up my Html parsing program? (Concurrently run Subroutines?)
by Marshall (Canon) on Jan 06, 2009 at 10:53 UTC

    This is just a simple outline of a web gizmo that could hit a lot of pages simultaneously. I am sure that the one second reponse time per page is due to web, not the Perl parsing of the data.

    I haven't written a multi-process LWP web app as the normal sites that I have LWP clients for would probably be upset if I wacked 'em more than a few times per second and I try to be polite. There are also limits as to how many connections you can have open at once...I've never come close in Windows, so I don't know.

    The poster asked for psuedo code, and here is one attempt with no error or time out escapes. Fork() in Windows is weird and is actually a thread instead of a separate process.

    ---- client (single program in this case): maintains a list of requests that it wants answers to.. (replies to those requests are required).. Maybe this is just a hash table with URL's? put first request(s) onto request queue then talk_2_server; talk_2_server { while (a request hasn't been sent or some request hasn't been answered...) { if (server has reply to a previous request) { take it off outstanding queue, and deal with it.. this action will generate additional new requests that go onto queue...maybe requests for 20 sub-pages ..be careful .. you might overload you are talking to! } while (I have a new request in queue) { send it to server}; # might want to think about a "throttle" if hitting # same website } } maybe I'm done or I need to loop and stuff more things onto request queue and talk_2_server again... --- server: I see a new request, fork child to deal with it. (Like maybe get the info from URL X). I then wait for next request. --- child: I've got some answer, so I want to send result to client. I cooperate with other children to so that I can send an "atomic" reponse on the pipe back to client via some kind of locking mechanism. Then my job's done, I die. Message format could be as simple as first line in the URL you requested...followed by some html response.
      Good stuff here thanks for the reply!!

      A couple of questions, and my apologies if these are simple/dumb but I have 0 experience with fork, to date most of my programs have been very simple.

      1. Where in your pseudo code would I be forking a process?

      2. Does the queue maintain a list of requests to be passed to a forked process once the process completes?

      3. In a forked process do I need to be careful with passing by reference? Currently I'm passing a few variables (the LWP downloader and a spreadsheet::writeexcel method for output) by reference so as to reduce overhead.

      4. Also do I need to be careful about multiple write calls on my workbook? Currently I am doing the outputing after processing each page with information.

      Thanks for your help so far!
Re: How to speed up my Html parsing program? (Concurrently run Subroutines?)
by sflitman (Hermit) on Jan 06, 2009 at 06:58 UTC
    You could try Parallel::Simple and run multiple queries at once, but there's no real speed-up unless you're running on multiple cores. It's probably more worthwhile to take a look at your code and see how to improve it. If you're doing a lot of regular expression matching you might want to call study on your string as that does some groundwork which might speed things up.
    HTH, SSF
      This page contains a list of links which then need to be retrieved by the get_page_html() method and there content passed to retrieve_info();
      If the get_page_html() method retrieves pages over a network (rather than from disk), there is great potential for improving performance with forking or threads. In a single process/thread, network latency is additive. With multiple processes/threads, latency costs can run concurrently.
        Yes, it retrieves it over a network. I'm definitly going to look into forking, after doing some reading last night it seems this is my best bet at this point. Now I just need to figure out how to keep my variables indedpendant. I haven't dug into the code in the responses below yet but from the looks of it they seem like a great starting point!
      Wow I just got some time to look into study and I feel I will definitely make some gains incorporating this into my program. Thanks!