in reply to How to speed up my Html parsing program? (Concurrently run Subroutines?)

This is just a simple outline of a web gizmo that could hit a lot of pages simultaneously. I am sure that the one second reponse time per page is due to web, not the Perl parsing of the data.

I haven't written a multi-process LWP web app as the normal sites that I have LWP clients for would probably be upset if I wacked 'em more than a few times per second and I try to be polite. There are also limits as to how many connections you can have open at once...I've never come close in Windows, so I don't know.

The poster asked for psuedo code, and here is one attempt with no error or time out escapes. Fork() in Windows is weird and is actually a thread instead of a separate process.

---- client (single program in this case): maintains a list of requests that it wants answers to.. (replies to those requests are required).. Maybe this is just a hash table with URL's? put first request(s) onto request queue then talk_2_server; talk_2_server { while (a request hasn't been sent or some request hasn't been answered...) { if (server has reply to a previous request) { take it off outstanding queue, and deal with it.. this action will generate additional new requests that go onto queue...maybe requests for 20 sub-pages ..be careful .. you might overload you are talking to! } while (I have a new request in queue) { send it to server}; # might want to think about a "throttle" if hitting # same website } } maybe I'm done or I need to loop and stuff more things onto request queue and talk_2_server again... --- server: I see a new request, fork child to deal with it. (Like maybe get the info from URL X). I then wait for next request. --- child: I've got some answer, so I want to send result to client. I cooperate with other children to so that I can send an "atomic" reponse on the pipe back to client via some kind of locking mechanism. Then my job's done, I die. Message format could be as simple as first line in the URL you requested...followed by some html response.
  • Comment on Re: How to speed up my Html parsing program? (Concurrently run Subroutines?)
  • Download Code

Replies are listed 'Best First'.
Re^2: How to speed up my Html parsing program? (Concurrently run Subroutines?)
by BobFishel (Acolyte) on Jan 06, 2009 at 16:49 UTC
    Good stuff here thanks for the reply!!

    A couple of questions, and my apologies if these are simple/dumb but I have 0 experience with fork, to date most of my programs have been very simple.

    1. Where in your pseudo code would I be forking a process?

    2. Does the queue maintain a list of requests to be passed to a forked process once the process completes?

    3. In a forked process do I need to be careful with passing by reference? Currently I'm passing a few variables (the LWP downloader and a spreadsheet::writeexcel method for output) by reference so as to reduce overhead.

    4. Also do I need to be careful about multiple write calls on my workbook? Currently I am doing the outputing after processing each page with information.

    Thanks for your help so far!