Beefy Boxes and Bandwidth Generously Provided by pair Networks
XP is just a number
 
PerlMonks  

WWW::Mechanize doesn't respect <base>?

by Arik123 (Beadle)
on Apr 26, 2021 at 08:38 UTC ( [id://11131723]=perlquestion: print w/replies, xml ) Need Help??

Arik123 has asked for the wisdom of the Perl Monks concerning the following question:

I do something like

$mech = new WWW::Mechanize; $mech->get("http://domain.com/page"); print $mech->base;

It prints 'http://domain.com/page' although the page contains the line <base href="http://domain.com/">. As a result, all the (relative) links are broken:

for ($mech->links) { print $_->url_abs, "\n"; }

prints things like "http://domain.com/page/page2" instead of "http://domain.com/page2"

Any way to fix it, without the need to regex the page myself for the Base tag?

Thanks!

Replies are listed 'Best First'.
Re: WWW::Mechanize doesn't respect <base>?
by Corion (Patriarch) on Apr 26, 2021 at 09:09 UTC

    Somewhat related is this Github issue, but it seems that WWW::Mechanize tries to retrieve the value of base from the HTTP headers instead of (also, and with priority) looking at the HTML base tag.

    In vaguely related code, I've used the following to extract the value of the base tag:

    # Check if we have a <base> tag which should replace the user-supp +lied URL if( $_[0] =~ s!<\s*\bbase\b[^>]+\bhref=([^>]+)>!!i ) { # Extract the HREF: my $href= $1; if( $href =~ m!^(['"])(.*?)\1! ) { # href="..." , with quotes $href = $2; } elsif( $href =~ m!^([^>"' ]+)! ) { # href=... , without quotes $href = $1; } else { die "Should not get here, weirdo href= tag: [$href]" }; my $old_url = $url; $url = relative_url( $url, $href ); #warn "base: $old_url / $href => $url"; };
Re: WWW::Mechanize doesn't respect <base>?
by haj (Vicar) on Apr 26, 2021 at 10:24 UTC

    What's weird in that observation is that url_abs should give the correct link, even if your base element was ignored.

    use 5.020; use strict; use URI::URL; my $uri = URI::URL->new('page2','http://domain.com/page'); say "base is", $uri->base, "absolute link is", $uri->abs;

    This prints:

    base is: http://domain.com/page absolute link is: http://domain.com/page2

    As far as I can say, WWW::Mechanize derives absolute URLs in quite the same way.

Re: WWW::Mechanize doesn't respect <base>? (patch)
by Anonymous Monk on Apr 28, 2021 at 01:39 UTC
    Here you go
    --- old\lib\WWW\Mechanize.pm 2021-04-27 18:52:56.406250000 -070 +0 +++ new\lib\WWW\Mechanize.pm 2021-04-27 18:52:20.046875000 -070 +0 @@ -1435,6 +1435,7 @@ iframe => 'src', link => 'href', meta => 'content', + base => 'href', ); sub _extract_links { @@ -1512,6 +1513,9 @@ my $text; my $name; + if( $tag eq 'base' ){ + $self->{base} = $token->[1]{href} ; + } if ( $tag eq 'a' ) { $text = $parser->get_trimmed_text("/$tag"); $text = '' unless defined $text;

      After many experiments, I can say this patch works perfectly. Thanks!

      BTW, immediately after the call to $mech->get(), $mech->base() still gives the wrong value. It get fixed after a call to $mech->links(), which is OK for me.

Re: WWW::Mechanize doesn't respect <base>?
by Anonymous Monk on May 04, 2021 at 08:30 UTC

    You're correct, haj, that things should work even if 'base' is broken. So I did some more research, and here're my results:

    use WWW::Mechanize::Link; $u = WWW::Mechanize::Link->new ({url=>'./page2', base=>'http://domain.com/page/'}); print $u->url_abs, "\n"; $u = WWW::Mechanize::Link->new ({url=>'../page2', base=>'http://domain.com/page/'}); print $u->url_abs, "\n"; $u = WWW::Mechanize::Link->new ({url=>'./page2', base=>'http://domain.com/page'}); print $u->url_abs, "\n"; $u = WWW::Mechanize::Link->new ({url=>'../page2', base=>'http://domain.com/page'}); print $u->url_abs, "\n";

    The output is:

    http://domain.com/page/page2 http://domain.com/page2 http://domain.com/page2 http://domain.com/../page2

    As you can see, for links that start with ./ the base MUST NOT end with /, which for links that start with ../ the base MUST end with /. So, whether or not the <base> is honored, some links will be broken. Any cure?

      What you are showing here is just resolution of relative URLs. Section 5.2 of RFC 3986 has the gory details. The only difference is the fourth example, where the result should be http://domain.com/page2. Note that a trailing slash is significant in an URL - but whether there's more stuff after the rightmost slash is not. for the URL's purpose as a base URL.

      Could you please show what you are expecting?

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://11131723]
Approved by marto
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others chilling in the Monastery: (3)
As of 2024-03-29 02:30 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found