dannoura has asked for the wisdom of the Perl Monks concerning the following question:

Hi,

I'm using LWP::Simple to download pages from this site. Since I'm downloading 40,000 pages I'd like to know if there is any way to speed up the process since at the moment it's very slow. The code I'm using is:

#! c:\perl\bin use WWW::Search; $query = "prostate cancer"; $search = new WWW::Search('PubMed'); $search->native_query(WWW::Search::escape_query($query)); $search->maximum_to_retrieve(40000); while (my $result = $search->next_result()) { }

With the command:

query.pl>pc_query.txt

(I'm hoping the direct dump into a text file slightly speeds things up)

Does anyone have any ideas about how to speed this process up?

Replies are listed 'Best First'.
Re: speed up download for LWP::Simple
by tilly (Archbishop) on Jul 09, 2003 at 04:04 UTC
    Why are you ignoring their specific request and having a robot engage in exactly the kind of bulk download that they don't want you to do?

    It is called netiquitte. If you have the knowledge to write the robot, you should also know when not to. And if you choose to ignore that, expect them to do things like block access from your IP address in self-defence. If you are doing this for an employer, please do some research on what robots.txt files are for and then tell your employer that there is a real risk of being banned from accessing pub med - should you really continue?

    Update: I don't mean to imply that you are intentionally breaking the rules. Usually people just never realize that what they are doing is covered by robots.txt, which is why it is important to be proactive when the issue arises.

      Along these lines, you might find that using WWW::Robot decreases your runtime, by returning only those pages that the site wishes to have spidered.

      I read their robots.txt and, although it forbids what I'm doing now, it's done in coordination with one of their directors so it's ok.

        In that case then arranging for access to the local files the website is backed by would be much faster, and would avoid undue load on the publically used webservers.
Re: speed up download for LWP::Simple
by PodMaster (Abbot) on Jul 09, 2003 at 07:07 UTC
    Where are you using LWP::Simple? WWW::Search does not use LWP::Simple. The alternatives to LWP are HTTP::GHTTP, and HTTP::Lite, and HTTP::GHTTP proports to be the fastest of them all.

    The other thing you could do is parallel requests, perhaps with something like Parallel::ForkManager (see this and look around here for other examples of this type technique).

    MJD says "you can't just make shit up and expect the computer to know what you mean, retardo!"
    I run a Win32 PPM repository for perl 5.6.x and 5.8.x -- I take requests (README).
    ** The third rule of perl club is a statement of fact: pod is sexy.