This code snippet uses URI package to convert relative URLs contained in HTTP::Response objects (generated by LWP::UserAgent or WWW::Mechanize) to absolute URLs. Useful for CGI scripts that act as a proxy between the browser and website.
use LWP::UserAgent; use URI; my $response = LWP::UserAgent->new->get('http://search.cpan.org/'); $html = $response->content; $base = $response->base; # RegEx converts all links in $html to absolute URLs $html =~ s/<(.*?)(href|src|action|background)\s*=\s*("|'?)(.*?)\3((\s+ +.*?)*)>/"<".$1.$2."=\"".URI->new_abs($4,$base)->as_string."\"".$5.">" +/eigs; print $html;

Replies are listed 'Best First'.
Re: Convert Relative to Absolute URLs on-the-fly
by merlyn (Sage) on Feb 06, 2008 at 16:44 UTC

      Because, as your link says, "This text is copyright by CMP Media, LLC, and is used with their permission. Further distribution or use is not permitted." Can't use it, sorry. :-(

      HTML parsing would be the ideal way to do it if one is interested in extracting the links (and/or information from other tags) from the fetched page.
      I just wanted to make sure that when saving a fetched page, I save all absolute links so that next time when I open the page, I can navigate easily. I wasn't interested in extracting any information from the page. So, I came up with the RegEx above, which does the job in just 1 line.
        "does the job"? You mean "does the job most of the time, as long as there is no unusual HTML there".

        I'm just trying to point out that parsing HTML with a simple regex will fail from time to time, and should be advised against when there are other easy-to-use technologies that will get it right in just a few more lines of code. Hence, my followup to your post.

Re: Convert Relative to Absolute URLs on-the-fly
by sids (Acolyte) on Feb 08, 2008 at 04:03 UTC
    Instead of modifying the relative URLs everywhere, wouldn't it be better to just add a <base href="http://base.url/here/"/> tag in the html <head> section?

    You could do it either using a regex (just add it before the </head>) or parse the html and add it.


    If you want to improve, be content to be thought foolish and stupid. -- Epictetus

      Thanks!! This worked a treat!

      To get here I'd tried Google search terms like 'fill in URL', 'fill in hostname site or sitename' and 'html as the browser shows it', with a little luck saying this may help some other benighted soul to this page :-)