in reply to resolving URLs in downloaded pages
And here's how it could be used:package MyUserAgent; use base 'WWW::Mechanize'; use URI; sub file_root { my $self = shift; if (@_) { $self->{_file_root} = shift; } $self->{_file_root}; } sub send_request { my $self = shift; my $request = shift; my $old_uri = $request->uri; if ($old_uri->scheme eq 'bar') { if ($old_uri->path =~ m{\A/..(/|\z)}) { return LWP::UserAgent::_new_response($request, 404, "File not found - URL begins with /.."); } my $new_uri = URI->new("file:".$self->file_root."/".$old_uri->path +); $request->uri($new_uri); } my $ret = $self->SUPER::send_request($request, @_); $request->uri($old_uri); $ret; }
One nice thing about of this approach is that all the site-relative URLs will have the scheme bar:.my $m = MyUserAgent->new(); my $ROOT = "/var/www/ua-test"; $m->file_root($ROOT); sub visit { my $url = shift; warn "visiting $url...\n"; $m->get($url); if ($m->success) { print "successful for $url\n"; for my $link ($m->links) { print "got link: ".$link->url_abs."\n"; visit($link->url); } } else { print "Not successful for $url\n"; } } visit("bar:/index.html");
The constructor URI->new_abs()> (called in $m->get()) does the work of collapsing occurrences of .. in the url. If they occur at the beginning it just leaves them, and that makes it easy to tell if you've tried to updir your way past the root. E.g.:
my $base = "ftp:/a/b"; URI->new_abs("/c/d", $base) -> "ftp:/c/d" URI->new_abs("e/f/../g", $base) -> "ftp:/a/e/g" URI->new_abs("../g", $base) -> "ftp:/g" URI->new_abs("../../i", $base) -> "ftp:/../i" URI->new_abs("../../j/k/l/../m", $base) -> "ftp:/../j/k/m"
|
|---|