jonjacobmoon has asked for the wisdom of the Perl Monks concerning the following question:

So, this appears to my week for stumpers.

This one seems so obvious that there must be a solution, but I can't find it. Could there be a bug in URI?

The code below works on this url only if I add a slash. Problem is that this a program run on URLs that may or may not have a slash. I have corrected it with some regexes that added the slash at the end if it is not there, but I am wondering why URI is not smart enough to figure this out on its own.

#!/usr/bin/perl -w use strict; use LWP::UserAgent; use HTML::Parser; use URI; my $starturl = shift || die "No url supplied\n"; #"http://www.strathav +en.s-lanark.sch.uk/pages/ring.htm"; # my $baseuri = URI->new($starturl); my $cururi; my $url; my @urls ; push @urls,$starturl; my $agent = new LWP::UserAgent; my $parser = HTML::Parser->new(api_version => 3, start_h => [\&start ,"tagname, attr" +]); $agent->agent("Jonzilla/666"); while( $url = shift @urls) { my $request = new HTTP::Request 'GET' => $url; my $result = $agent->request($request); if ($result->is_success) { print "URL: $url\n"; #print $result->as_string; $parser->parse($result->content); } else { print "Error: " . $result->status_line . " URL=$url, $baseuri\ +n"; } } sub start { my($tag,$attr) = @_; if ($tag eq 'frame' ) { my $thisuri = URI->new($attr->{src}); push @urls, $thisuri->abs($cururi); } }


I admit it, I am Paco.

Replies are listed 'Best First'.
Re: Lack of Trailing Slash Confuses URI
by blokhead (Monsignor) on Sep 21, 2002 at 17:01 UTC
    In general, if you access a URI aimed at a directory but don't have the trailing slash, you get redirected to the URI with the trailing slash included. When I type in the users.pandora.be/dvt URL in my browser, it changes to /dvt/ with a trailing slash.

    Why does it do this? Because index.html in the /dvt/ directory may have relative links (It does in this example). If the page links to the relative URI "top.htm", but the browser is looking at a URI of /dvt, it will try to load /top.htm when the link is followed, and not /dvt/top.htm as we would expect.

    When I run your script, it complains in line 48 about lack of arguments to $thisuri->abs($cururi). You have never set $cururi in your code! I modified things a bit and came up with something that works on the URL you give:

    if ($result->is_success) { $cururi = $result->base->as_string; print "URL: $url ($cururi)\n";
    $result->base returns the base URL of the HTTP response, since when you request /dvt, you get relocated to /dvt/. You must use this value as the base URL of your relative URIs. If you try to set $cururi = $url;, you'll get 404 errors during your recursion when trying to access /top.htm, etc.. not /dvt/top.htm. After modifying these lines, I get this output:
    $ perl lwp-paco.pl http://users.pandora.be/dvt URL: http://users.pandora.be/dvt (http://users.pandora.be/dvt/) URL: http://users.pandora.be/dvt/top.htm (http://users.pandora.be/dvt/ +top.htm) URL: http://users.pandora.be/dvt/top.htm (http://users.pandora.be/dvt/ +top.htm) URL: http://users.pandora.be/dvt/tree.htm (http://users.pandora.be/dvt +/tree.htm) URL: http://users.pandora.be/dvt/start.htm (http://users.pandora.be/dv +t/start.htm)
    Notice the first URL: line, which has a different request URL than response base URL (in parentheses).

    blokhead

      First, I apologize for the bug. When I submitted the code, I edited in submit box to take out my corrected code to show the bug. That is where $cururi was set.

      Anyway..... thank you, thank you, thank you. I was not aware of the $result->base->as_string method. I knew that it had to be there :)


      I admit it, I am Paco.