Site Crawler

Frisbeeman has asked for the wisdom of the Perl Monks concerning the following question:

I'm trying to make a really simple search engine for a site that looks up keywords in a dbm database. In order to populate the database, I am trying to write a crawler that will read in the meta-tag keywords and then crawl to all linked pages. I have a really simple test written that connects to a page and reads it's content. I found a useful tutorial here. Here's what I have so far:

#!/usr/local/bin/perl

# Tests for site crawler / db creator

use LWP::Simple;
use LWP::UserAgent;
use HTTP::Request;
use HTTP::Response;
use HTML::LinkExtor;

$browser = LWP::UserAgent->new();
$browser->timeout(10);

$URL = 'http://www.yahoo.com/';

my $request = HTTP::Request->new(GET => $URL);
my $response = $browser->request($request);
if ($response->is_error()) {printf "%s\n", $response->status_line;}

$contents = $response->content();

print "Content-type: text/html\n\n";

print "<html>\n<head><title>Site Crawler</title></head>\n<body>";
print "<b>Here is the page's contents:</b><br>$contents";
print "</body>\n";
print "</html>";
[download]

This works great for yahoo.com or other generic sites. It shows yahoo.com with "Here is the page's contents:" at the top. However, when I try it on my site, it fails to connect. My web host told me that they know of the problem, but they don't know why it doesn't work. They told me to see if I could find a work around. I tried it in PHP as well, same deal. Anyway, while I struggle with the web host to fix this issue, I'm trying to find another way to do it. What are other ways to connect to a url and retrieve the data? Thanks for the help.

Comment on Site Crawler Download Code

Replies are listed 'Best First'.
Re: Site Crawler by Elian (Parson) on Jun 13, 2002 at 21:35 UTC
I worked for a search engine (the late, lamented Northern Light) and you've just triggered a few pet peeves. First, if you're going to pay attention to meta tag keywords, you'd better be crawling only your own site. In general meta information is not just useless, it's actively deceptive. Trusting it generally is worse than a waste of time. You'll end up with a database full of lies. Sad, but true. Second, if you're crawling pages that aren't yours, you'd better obey the robot rules. Use LWP::RobotUA here, with a reasonable time limit. The default minute delay's fine, but dropping it down as low as 10 or 20 seconds between requests is probably fine. (I'd leave it at the minute delay, personally) If you're not going to use LWP::RobotUA, and are crawling other people's pages, then you'd darned well better make sure you space out the request. (Your ISP may want you do to this for your own pages--Snagging a couple of thousand pages over a cable modem or other reasonably high bandwidth connection can be pretty harsh) If you're going to do it, then do it right, be polite, and respect the "Keep off the Grass" signs.	[reply]
Re: Re: Site Crawler by Frisbeeman (Initiate) on Jun 13, 2002 at 22:27 UTC
I am indeed just crawling my own site. I thought I made that pretty clear, but I guess I could been clearer. I am aware of the issues you raised.	[reply]
Re: Re: Re: Site Crawler by Elian (Parson) on Jun 13, 2002 at 23:00 UTC
Yup, it was unclear. (I'm presuming you've got a lot of dynamic content, otherwise crawling this way's kinda pointless)	[reply]
•Re: Site Crawler by merlyn (Sage) on Jun 13, 2002 at 21:06 UTC
I am trying to write a crawler that will read in the meta-tag keywords and then crawl to all linked pages You mean like this one from my column? -- Randal L. Schwartz, Perl hacker	[reply]
Re: •Re: Site Crawler by tjh (Curate) on Jun 13, 2002 at 22:30 UTC
I know you've been at it (Perl) for years, but I continue to be astonished at the range of topics you've addressed in articles, etc. Well done, and thanks. (Not to overlook the voluminous work of other Monks elsewhere, who likewise amaze me.)	[reply]
Re: •Re: Site Crawler by Frisbeeman (Initiate) on Jun 13, 2002 at 22:37 UTC
I do mean one very much like the one from your column. I would like to write my own (good learning experience) and the article says that the text is copyrighted. Do you have an example running anywhere, so I could take a look at it running? Also, if I can just copy it, is there somewhere I can get the code without the linenumbers? I suspect that this implementation will run into the same problems caused by my isp, but I would like to try. Thanks.	[reply]
•Re: Re: •Re: Site Crawler by merlyn (Sage) on Jun 13, 2002 at 22:46 UTC
I haven't yet put the link from that page to the listing page, but if you go back to the table of contents, you can find a listing link for that article. As for copyright, I can't officially grant you the ability to use the code, but I know that there's nobody around that will notice or care that you started with that code. {grin} -- Randal L. Schwartz, Perl hacker	[reply]
Re: Site Crawler by dws (Chancellor) on Jun 13, 2002 at 20:36 UTC
This works great for yahoo.com or other generic sites. ... However, when I try it on my site, it fails to connect. My web host told me that they know of the problem, but they don't know why it doesn't work. You don't mention whether this crawler is running on your desktop or form your ISP. If your script (or the networking layer beneath it) is having problems resolving the name of your host, where you're running makes a difference. If you are running on your desktop, and can bring your site up in a browser, you should be able to bring it up via LWP. If you are running the script on your ISP's box, and cannot access your site, there's probably a DNS problem at your ISP. If you have shell access there, try `% ping www.whateveryoursitenameis.com` If that fails, have a chat with your ISP again.	[reply] [d/l]
Re: Re: Site Crawler by Frisbeeman (Initiate) on Jun 13, 2002 at 20:47 UTC
Thanks for the reply. I'm running it from the isp, and do have shell access. I tried to ping our domain and it didn't work. I'll see what the isp thinks about that.	[reply]
Re: Site Crawler by Frisbeeman (Initiate) on Jun 13, 2002 at 20:58 UTC
Just to be more complete: I have tried `$contents = get($url);` instead of all of the LWP::UserAgent, HTTP::Request, etc stuff. It didn't work either.	[reply] [d/l]
Re: Site Crawler by Frisbeeman (Initiate) on Jun 13, 2002 at 23:11 UTC
Well, I talked to my isp, they told me to try connecting to their proxy. I didn't push it too much, but it works now. I guess I'm back to PHP now, sorry guys.	[reply]