Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

I'm trying to create a web crawler and I was interested what information you monks have on creating them. I know there are plenty of ways these could be mis-used but in the end it'll be setup so it scrapes 1 page per period of time to go easy on the server.

What I'm trying to do is put in a $url and have it scan that page for links. Then, if it finds any, it'll go to the first link and branch off as many times as it can until all pages it has access and links to are scanned.

I could probably manage using LWP::Simple or maybe LWP::UserAgent to scrape the main page for the links and possibly do it for the first link after that. But when speaking of branching out to get all the links from one page (like a tree of links), I have NO idea where to begin.

And using regexes would probably be a pain because webmasters don't always use FULL URLS like they should. Then you have to scan dynamic URLs like /cgi-bin/script.cgi?param=12&name=test .

Are there modules to do this type of thing I could use? Any advice would be much appreciated.

Replies are listed 'Best First'.
Re: Creating a web crawler (theory)
by brian_d_foy (Abbot) on Jan 28, 2005 at 17:30 UTC

    To grab pages, you already know about LWP::UserAgent. To extract links, you can use HTML::SimpleLinkExtor, which base turn relative URLs into absolute ones, or if you need something more fancy, you can write your own subcless of HTML::Parser.

    If you already have the URLs and you want to turn relative URLs into absolute ones, URI can do that for you.

    You can look at my personal web snarfer, webreaper, which has code fora lot of the things you need to do. Steal what you need.

    --
    brian d foy <bdfoy@cpan.org>
Re: Creating a web crawler (theory)
by hardburn (Abbot) on Jan 28, 2005 at 17:42 UTC

    It's pretty easy to do recursively:

    sub spider { my $page = shift; return if page_already_spidered( $page ); my $mech = get_WWW_Mechanize( $page ); spider( $_ ) for $mech->links; # Perform scraping of page return; }

    LWP::UserAgent with HTML::LinkExtractor works here, but WWW::Mechanize combines both of those for you.

    Update: For politeness, you could add a sleep 1 at the top of the spider() subroutine. That should keep the load on the server down.

    "There is no shame in being self-taught, only in not trying to learn in the first place." -- Atrus, Myst: The Book of D'ni.

Re: Creating a web crawler (theory)
by Fletch (Bishop) on Jan 28, 2005 at 18:07 UTC
    And using regexes would probably be a pain because webmasters don't always use FULL URLS like they should.

    Erm, no relative URLs are perfectly valid. Do you really think it'd be a good idea to have a hyooman explicitly add http://www.wherever.com/six/levels/deep/into/some/path/ to the front of every URI? Not every page is automatically generated.

    At any rate, see the new_abs method from URI for how to handle these easily.

Re: Creating a web crawler (theory)
by Zaxo (Archbishop) on Jan 28, 2005 at 17:37 UTC

    See Parallel::ForkManager for a convenient way to spawn off new processes and to limit the the number of active ones.

    You probably want to extract the host or domain of a link for your commendable desire to rate-limit your requests. URI.pm can do that for you. It's distributed with perl, I believe.

    If you find external links in a document, a (perhaps partial) breadth-first traversal strategy will give you something to do while waiting to next hit the current domain. Don't forget about robot.rules. You ought to plan to to honor that.

    A CPAN search for Robot or Parallel turns up LWP::Parallel::RobotUA and several other candidates to help with this.

    After Compline,
    Zaxo

Re: Creating a web crawler (theory)
by gaal (Parson) on Jan 29, 2005 at 14:14 UTC
    Don't forget to save URLs you've visited in a lookup table (e.g., a hash). Don't visit a URL you'd already been to, at least not if you were there recently.

    This is simple, but so is the rage of a sysadmin whose site is being crawled in a loop. :)

    Use a User-Agent: header that allows admins to contact you should they need to.

    Oh, and honor robots.txt, yes?

Re: Creating a web crawler (theory)
by ww (Archbishop) on Jan 28, 2005 at 20:10 UTC
    AM wrote: because webmasters don't always use FULL URLS like they should.

    Certainly, there are reasons for using full URLs occasionally BUT WHERE DID YOU GET THAT IDEA? (That's not purely sarcasm. If you can offer an authority for that, I'd like to read it!)

    IIRC, a full URL forces the visitors browser to revisit the DNS server, creating needless traffic and slowing rendering. (see brian_d_foy's reply below re DNS revisits: He's right and I clearly IDidNotRC ...but I believe the balance of this post can stand!)

    However, you have a number of good answers on how to deal with your generic question, and good suggestions for dealing with relative links.

    But you may want to consider the volume of data you're apt to deal with. One of my sites has ~1600 pages, and well over 5000 links. I can collect those links with a script -- ON A LOCAL MIRROR (ie, no net time and no competition for the server's attention)-- in about 15 seconds but I can't even guess what the time required would be if one were to try to chase down all the links on the secondary, tertiary, etc, pages...

      A full URL forces the browser to revisit DNS? Where did you get that idea? Even if you have some wacky set-up where you aren't caching replies, it doesn't affect rendering. As for needless traffic, a DNS query isn't much compared to all those images we ask our browser to download.

      Relative URLs are a convenience for our typing. To follow a link, the browser still needs to make it an absolute URL, then go where that URL says. A relative URL in an HTML page is not a secret signal to the browser to use some sort of quick fetching algorithm.

      You might be thinking about the difference between external and internal redirections. An external redirection is a full HTTP response that cause the user-agent to fetch the resource from a different URL. An internal redirection can be caught by the web server and handled without another request from the user-agent. Neither of these have anything to do with HTML though.

      --
      brian d foy <bdfoy@cpan.org>
        Relative URLs do better than save typing. They save retyping. If you move a project inside a site or just rename it, with relative paths you don't have to hunt down all the links and change them.