Re: Which modules needed?

It sounds like you're "linting" incoming HTML. I've done this before, so you'll need to use LWP::Simple or HTTP::Request to HEAD or GET the raw content, used like:

use strict; 
my $url         = "http://www.foo.bar/blort/quux.html";
my $req         = HTTP::Request->new(HEAD=>$url);
my $ua          = LWP::UserAgent->new;
my $resp        = $ua->request($req);
my $type        = $resp->header('Content-Type');
my $status_line = $resp->status_line;
[download]

Note I'm using LWP::UserAgent in there, and also I'm testing for the return on the HEAD request, to make sure it's a status of 200. If it's anything but a 200, you have to react accordingly (i.e. 404 is a bad url, 500 is an access error, and so on).

Replace HEAD with GET to pull the raw HTML page itself. Ideally you want to test HEAD on the page first, before pulling the content, but that depends on your design, and if you are pulling lots of pages (i.e. a web spider) or one page at a time (upon user request).

You'll also likely want to use URI::Escape to make sure you are handling spaces, @ signs and other "foreign" characters properly as given, so they don't get parsed improperly by your tools or shell. Used like:

use strict; 
my $url         = "http://www.foo.bar/blort/quux.html";
my $safeurl     = uri_escape($url);
my $newurl      = uri_unescape($safeurl);

print "URL.....: $url\n";
print "Safe URL: $safeurl\n";
print "New URL.: $newurl\n";
[download]

The other modules you may want to use are HTML::LinkExtor (used to extract the links), URI::URL (to play with URI objects), and HTTP::Request (to manipulate the request object).

I'll leave it up to you to find code examples that represent how to use those modules.

You may want to look at Ovid's CGI Course for some more ideas.

Comment on Re: Which modules needed? Select or Download Code