pc88mxer has asked for the wisdom of the Perl Monks concerning the following question:

Suppose I have downloaded a site's web pages into a directory. I am using WWW::Mechanize to recursively traverse the pages like this:
my $m = new WWW::Mechanize; my $ROOT = "/path/to/download/directory"; sub visit { my $url = shift; $m->get($url); ... for my $link ($m->links) { visit($link->url_abs); # problem is here } } visit("file:$ROOT/index.html");
Basically I want to make $ROOT the new root of the site.

There are two issues:

Examples:
<a href="/foo"></a> -- $link->url_abs is "file:/foo" <a href="../bar"></a> -- $link->url_abs is "file:/path/to/download/bar"
Is there a clean way to handle these problems?

I've thought about starting up a local web server to serve the pages, but it seems like a lot of overhead just to perform some url munging.

Replies are listed 'Best First'.
Re: resolving URLs in downloaded pages
by jethro (Monsignor) on Jun 12, 2008 at 02:04 UTC
    wget has an option to convert links in downloaded websites: --convert-links

    I don't know whether it covers both your issues, but a test can't hurt.

Re: resolving URLs in downloaded pages (hacking LWP::UserAgent)
by pc88mxer (Vicar) on Jun 12, 2008 at 17:59 UTC
    Here's a hack I've come up with that does what I'm looking for. The idea is to subclass WWW::Mechanize (which itself is subclassed from LWP::UserAgent) to silently rewrite URLs that have a special scheme. In my case, I've chosen bar: as the special scheme which gets re-written to file:$ROOT.
    package MyUserAgent; use base 'WWW::Mechanize'; use URI; sub file_root { my $self = shift; if (@_) { $self->{_file_root} = shift; } $self->{_file_root}; } sub send_request { my $self = shift; my $request = shift; my $old_uri = $request->uri; if ($old_uri->scheme eq 'bar') { if ($old_uri->path =~ m{\A/..(/|\z)}) { return LWP::UserAgent::_new_response($request, 404, "File not found - URL begins with /.."); } my $new_uri = URI->new("file:".$self->file_root."/".$old_uri->path +); $request->uri($new_uri); } my $ret = $self->SUPER::send_request($request, @_); $request->uri($old_uri); $ret; }
    And here's how it could be used:
    my $m = MyUserAgent->new(); my $ROOT = "/var/www/ua-test"; $m->file_root($ROOT); sub visit { my $url = shift; warn "visiting $url...\n"; $m->get($url); if ($m->success) { print "successful for $url\n"; for my $link ($m->links) { print "got link: ".$link->url_abs."\n"; visit($link->url); } } else { print "Not successful for $url\n"; } } visit("bar:/index.html");
    One nice thing about of this approach is that all the site-relative URLs will have the scheme bar:.

    The constructor URI->new_abs()> (called in $m->get()) does the work of collapsing occurrences of .. in the url. If they occur at the beginning it just leaves them, and that makes it easy to tell if you've tried to updir your way past the root. E.g.:

    my $base = "ftp:/a/b"; URI->new_abs("/c/d", $base) -> "ftp:/c/d" URI->new_abs("e/f/../g", $base) -> "ftp:/a/e/g" URI->new_abs("../g", $base) -> "ftp:/g" URI->new_abs("../../i", $base) -> "ftp:/../i" URI->new_abs("../../j/k/l/../m", $base) -> "ftp:/../j/k/m"