Here's a hack I've come up with that does what I'm looking for. The idea is to subclass WWW::Mechanize (which itself is subclassed from LWP::UserAgent) to silently rewrite URLs that have a special scheme. In my case, I've chosen bar: as the special scheme which gets re-written to file:$ROOT.
package MyUserAgent; use base 'WWW::Mechanize'; use URI; sub file_root { my $self = shift; if (@_) { $self->{_file_root} = shift; } $self->{_file_root}; } sub send_request { my $self = shift; my $request = shift; my $old_uri = $request->uri; if ($old_uri->scheme eq 'bar') { if ($old_uri->path =~ m{\A/..(/|\z)}) { return LWP::UserAgent::_new_response($request, 404, "File not found - URL begins with /.."); } my $new_uri = URI->new("file:".$self->file_root."/".$old_uri->path +); $request->uri($new_uri); } my $ret = $self->SUPER::send_request($request, @_); $request->uri($old_uri); $ret; }
And here's how it could be used:
my $m = MyUserAgent->new(); my $ROOT = "/var/www/ua-test"; $m->file_root($ROOT); sub visit { my $url = shift; warn "visiting $url...\n"; $m->get($url); if ($m->success) { print "successful for $url\n"; for my $link ($m->links) { print "got link: ".$link->url_abs."\n"; visit($link->url); } } else { print "Not successful for $url\n"; } } visit("bar:/index.html");
One nice thing about of this approach is that all the site-relative URLs will have the scheme bar:.

The constructor URI->new_abs()> (called in $m->get()) does the work of collapsing occurrences of .. in the url. If they occur at the beginning it just leaves them, and that makes it easy to tell if you've tried to updir your way past the root. E.g.:

my $base = "ftp:/a/b"; URI->new_abs("/c/d", $base) -> "ftp:/c/d" URI->new_abs("e/f/../g", $base) -> "ftp:/a/e/g" URI->new_abs("../g", $base) -> "ftp:/g" URI->new_abs("../../i", $base) -> "ftp:/../i" URI->new_abs("../../j/k/l/../m", $base) -> "ftp:/../j/k/m"

In reply to Re: resolving URLs in downloaded pages (hacking LWP::UserAgent) by pc88mxer
in thread resolving URLs in downloaded pages by pc88mxer

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.