Creating a web crawler (theory)

Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: Creating a web crawler (theory) by brian_d_foy (Abbot) on Jan 28, 2005 at 17:30 UTC
To grab pages, you already know about LWP::UserAgent. To extract links, you can use HTML::SimpleLinkExtor, which base turn relative URLs into absolute ones, or if you need something more fancy, you can write your own subcless of HTML::Parser. If you already have the URLs and you want to turn relative URLs into absolute ones, URI can do that for you. You can look at my personal web snarfer, webreaper, which has code fora lot of the things you need to do. Steal what you need. -- brian d foy <bdfoy@cpan.org>	[reply]
Re: Creating a web crawler (theory) by hardburn (Abbot) on Jan 28, 2005 at 17:42 UTC
It's pretty easy to do recursively: `sub spider { my $page = shift; return if page_already_spidered( $page ); my $mech = get_WWW_Mechanize( $page ); spider( $_ ) for $mech->links; # Perform scraping of page return; }` [download] LWP::UserAgent with HTML::LinkExtractor works here, but WWW::Mechanize combines both of those for you. Update: For politeness, you could add a `sleep 1` at the top of the `spider()` subroutine. That should keep the load on the server down. "There is no shame in being self-taught, only in not trying to learn in the first place." -- Atrus, Myst: The Book of D'ni.	[reply] [d/l] [select]
Re: Creating a web crawler (theory) by Fletch (Bishop) on Jan 28, 2005 at 18:07 UTC
And using regexes would probably be a pain because webmasters don't always use FULL URLS like they should. Erm, no relative URLs are perfectly valid. Do you really think it'd be a good idea to have a hyooman explicitly add `http://www.wherever.com/six/levels/deep/into/some/path/` to the front of every URI? Not every page is automatically generated. At any rate, see the `new_abs` method from URI for how to handle these easily.	[reply] [d/l]
Re: Creating a web crawler (theory) by Zaxo (Archbishop) on Jan 28, 2005 at 17:37 UTC
See Parallel::ForkManager for a convenient way to spawn off new processes and to limit the the number of active ones. You probably want to extract the host or domain of a link for your commendable desire to rate-limit your requests. URI.pm can do that for you. It's distributed with perl, I believe. If you find external links in a document, a (perhaps partial) breadth-first traversal strategy will give you something to do while waiting to next hit the current domain. Don't forget about robot.rules. You ought to plan to to honor that. A CPAN search for Robot or Parallel turns up LWP::Parallel::RobotUA and several other candidates to help with this. After Compline, Zaxo	[reply]
Re: Creating a web crawler (theory) by gaal (Parson) on Jan 29, 2005 at 14:14 UTC
Don't forget to save URLs you've visited in a lookup table (e.g., a hash). Don't visit a URL you'd already been to, at least not if you were there recently. This is simple, but so is the rage of a sysadmin whose site is being crawled in a loop. :) Use a `User-Agent:` header that allows admins to contact you should they need to. Oh, and honor `robots.txt`, yes?	[reply] [d/l] [select]
Re: Creating a web crawler (theory) by ww (Archbishop) on Jan 28, 2005 at 20:10 UTC
AM wrote: because webmasters don't always use FULL URLS like they should. Certainly, there are reasons for using full URLs occasionally BUT WHERE DID YOU GET THAT IDEA? (That's not purely sarcasm. If you can offer an authority for that, I'd like to read it!) ~~IIRC, a full URL forces the visitors browser to revisit the DNS server, creating needless traffic and slowing rendering.~~ (see brian_d_foy's reply below re DNS revisits: He's right and I clearly IDidNotRC ...but I believe the balance of this post can stand!) However, you have a number of good answers on how to deal with your generic question, and good suggestions for dealing with relative links. But you may want to consider the volume of data you're apt to deal with. One of my sites has ~1600 pages, and well over 5000 links. I can collect those links with a script -- ON A LOCAL MIRROR (ie, no net time and no competition for the server's attention)-- in about 15 seconds but I can't even guess what the time required would be if one were to try to chase down all the links on the secondary, tertiary, etc, pages...	[reply]
Re^2: Creating a web crawler (theory) by brian_d_foy (Abbot) on Jan 28, 2005 at 21:01 UTC
A full URL forces the browser to revisit DNS? Where did you get that idea? Even if you have some wacky set-up where you aren't caching replies, it doesn't affect rendering. As for needless traffic, a DNS query isn't much compared to all those images we ask our browser to download. Relative URLs are a convenience for our typing. To follow a link, the browser still needs to make it an absolute URL, then go where that URL says. A relative URL in an HTML page is not a secret signal to the browser to use some sort of quick fetching algorithm. You might be thinking about the difference between external and internal redirections. An external redirection is a full HTTP response that cause the user-agent to fetch the resource from a different URL. An internal redirection can be caught by the web server and handled without another request from the user-agent. Neither of these have anything to do with HTML though. -- brian d foy <bdfoy@cpan.org>	[reply]
Re^3: Creating a web crawler (theory) by gaal (Parson) on Jan 29, 2005 at 14:10 UTC
Relative URLs do better than save typing. They save retyping. If you move a project inside a site or just rename it, with relative paths you don't have to hunt down all the links and change them.	[reply]
Re^4: Creating a web crawler (theory) by brian_d_foy (Abbot) on Jan 29, 2005 at 15:42 UTC
Re^5: Creating a web crawler (theory) by gaal (Parson) on Jan 29, 2005 at 16:26 UTC
Some notes below your chosen depth have not been shown here