resolving URLs in downloaded pages

pc88mxer has asked for the wisdom of the Perl Monks concerning the following question:

Suppose I have downloaded a site's web pages into a directory. I am using WWW::Mechanize to recursively traverse the pages like this:

my $m = new WWW::Mechanize;
my $ROOT = "/path/to/download/directory";

sub visit {
  my $url = shift;
  $m->get($url);
  ...
  for my $link ($m->links) {
    visit($link->url_abs); # problem is here
  }
}
visit("file:$ROOT/index.html");
[download]

Basically I want to make $ROOT the new root of the site.

There are two issues:

Absolute urls (like /foo) won't get re-based correctly off of $ROOT.
In relative urls I'd like to protect against the use of .. which go beyond the top level.

Examples:

<a href="/foo"></a>
  -- $link->url_abs is "file:/foo"
<a href="../bar"></a>
  -- $link->url_abs is "file:/path/to/download/bar"
[download]

Is there a clean way to handle these problems?

I've thought about starting up a local web server to serve the pages, but it seems like a lot of overhead just to perform some url munging.

Comment on resolving URLs in downloaded pages Select or Download Code

Replies are listed 'Best First'.
Re: resolving URLs in downloaded pages by jethro (Monsignor) on Jun 12, 2008 at 02:04 UTC
wget has an option to convert links in downloaded websites: --convert-links I don't know whether it covers both your issues, but a test can't hurt.	[reply]
Re: resolving URLs in downloaded pages (hacking LWP::UserAgent) by pc88mxer (Vicar) on Jun 12, 2008 at 17:59 UTC
Here's a hack I've come up with that does what I'm looking for. The idea is to subclass `WWW::Mechanize` (which itself is subclassed from `LWP::UserAgent`) to silently rewrite URLs that have a special scheme. In my case, I've chosen `bar:` as the special scheme which gets re-written to `file:$ROOT`. package MyUserAgent; use base 'WWW::Mechanize'; use URI; sub file_root { my $self = shift; if (@_) { $self->{_file_root} = shift; } $self->{_file_root}; } sub send_request { my $self = shift; my $request = shift; my $old_uri = $request->uri; if ($old_uri->scheme eq 'bar') { if ($old_uri->path =~ m{\A/..(/\|\z)}) { return LWP::UserAgent::_new_response($request, 404, "File not found - URL begins with /.."); } my $new_uri = URI->new("file:".$self->file_root."/".$old_uri->path +); $request->uri($new_uri); } my $ret = $self->SUPER::send_request($request, @_); $request->uri($old_uri); $ret; } [download] And here's how it could be used: `my $m = MyUserAgent->new(); my $ROOT = "/var/www/ua-test"; $m->file_root($ROOT); sub visit { my $url = shift; warn "visiting $url...\n"; $m->get($url); if ($m->success) { print "successful for $url\n"; for my $link ($m->links) { print "got link: ".$link->url_abs."\n"; visit($link->url); } } else { print "Not successful for $url\n"; } } visit("bar:/index.html");` [download] One nice thing about of this approach is that all the site-relative URLs will have the scheme `bar:`. The constructor `URI->new_abs()>` (called in `$m->get()`) does the work of collapsing occurrences of `..` in the url. If they occur at the beginning it just leaves them, and that makes it easy to tell if you've tried to updir your way past the root. E.g.: `my $base = "ftp:/a/b"; URI->new_abs("/c/d", $base) -> "ftp:/c/d" URI->new_abs("e/f/../g", $base) -> "ftp:/a/e/g" URI->new_abs("../g", $base) -> "ftp:/g" URI->new_abs("../../i", $base) -> "ftp:/../i" URI->new_abs("../../j/k/l/../m", $base) -> "ftp:/../j/k/m"` [download]	[reply] [d/l] [select]