in reply to resolving URLs in downloaded pages

Here's a hack I've come up with that does what I'm looking for. The idea is to subclass WWW::Mechanize (which itself is subclassed from LWP::UserAgent) to silently rewrite URLs that have a special scheme. In my case, I've chosen bar: as the special scheme which gets re-written to file:$ROOT.
package MyUserAgent; use base 'WWW::Mechanize'; use URI; sub file_root { my $self = shift; if (@_) { $self->{_file_root} = shift; } $self->{_file_root}; } sub send_request { my $self = shift; my $request = shift; my $old_uri = $request->uri; if ($old_uri->scheme eq 'bar') { if ($old_uri->path =~ m{\A/..(/|\z)}) { return LWP::UserAgent::_new_response($request, 404, "File not found - URL begins with /.."); } my $new_uri = URI->new("file:".$self->file_root."/".$old_uri->path +); $request->uri($new_uri); } my $ret = $self->SUPER::send_request($request, @_); $request->uri($old_uri); $ret; }
And here's how it could be used:
my $m = MyUserAgent->new(); my $ROOT = "/var/www/ua-test"; $m->file_root($ROOT); sub visit { my $url = shift; warn "visiting $url...\n"; $m->get($url); if ($m->success) { print "successful for $url\n"; for my $link ($m->links) { print "got link: ".$link->url_abs."\n"; visit($link->url); } } else { print "Not successful for $url\n"; } } visit("bar:/index.html");
One nice thing about of this approach is that all the site-relative URLs will have the scheme bar:.

The constructor URI->new_abs()> (called in $m->get()) does the work of collapsing occurrences of .. in the url. If they occur at the beginning it just leaves them, and that makes it easy to tell if you've tried to updir your way past the root. E.g.:

my $base = "ftp:/a/b"; URI->new_abs("/c/d", $base) -> "ftp:/c/d" URI->new_abs("e/f/../g", $base) -> "ftp:/a/e/g" URI->new_abs("../g", $base) -> "ftp:/g" URI->new_abs("../../i", $base) -> "ftp:/../i" URI->new_abs("../../j/k/l/../m", $base) -> "ftp:/../j/k/m"