in reply to Pulling a Page with LWP::UserAgent and fixing URLs?

Firstly, don't try to parse HTML yourself. Use one of the many CPAN modules available. I prefer HTML::TokeParser::Simple.

I'm not exactly sure what you're trying to do with the image tags? Are you trying to fix broken links, or make them absolute instead of relative?

  • Comment on Re: Pulling a Page with LWP::UserAgent and fixing URLs?

Replies are listed 'Best First'.
Re^2: Pulling a Page with LWP::UserAgent and fixing URLs?
by MrForsythExeter (Novice) on Nov 09, 2004 at 12:03 UTC
    Yeah sorry im trying to make them absolute so for example src="http://www.xxxxx.co.uk/images/uploads/blah.gif" src="../../blah.gif" src="/images/uploads/blah.gif" all become src="http://www.xxxx.com/images/uploads/blah.gif" Hope that helps you understand me
      Here's some code to do some of that:
      my $parser = HTML::TokeParser::Simple->new(string => $html); my $new_html; while ( my $token = $parser->get_token ) { for ( 'src', 'href' ) { my $attr = $_; my $value; next unless $value = $token->get_attr($attr); next unless $value =~ /\.(gif|jpe?g|png|swf)$/; $value =~ s/\/([\.[:word:]\-]+?)$/$new_url$1/; $token->set_attr($attr,$value); } $new_html .= $token->as_is; }
      Then your result is in $new_html. Of course, this won't handle everything, since you could have references to images, etc in Javascript, for example.
      ok then

      use URI::URL;

      Teabag

      -- Siggy Played Guitar
      Sure there's more than one way, but one just needs one anyway - Teabag
        URI::URL is only used for old stuff.. backward compatibility and all that, Looks like URI is the one, however using this are you saying i should parse out all the URL's and then use this to fix them and put them in.. or could i do a regexp with /e on the end and do it all in one line?

      Like suggested before, take a look at URI and if you do, please please please don't overlook the nifty -yet annoying ;)- <base... /> module.

      --
      b10m

      All code is usually tested, but rarely trusted.