Xxaxx has asked for the wisdom of the Perl Monks concerning the following question:

In grabbing a webpage from a foreign server I need to modify any relative urls to abosolute.

The code snippet below works but I was wondering if there is a better way.

..... my ($urltop) = 'http://www.somewhere.com/'; my ($urltodir) = 'http://www.somewhere.com/folder/'; my(@matches) = ($content =~ /href="([^"]*)"/gi); foreach my $match (@matches) { if ($match =~ /^http/i) { ## absolute leave alone } elsif ($match =~ /^\//) { $content =~ s/href="$match"/href="$urltop$match"/gi; } else { $content =~ s/href="$match"/href="$urltodir\/$match"/gi; } }
I could do the leading slash stuff with:
$content =~ s/href="^\/([^"]*)"/href="$urltop/$1"/gi;
But I can't figure out how to do the other matches without the foreach.

Thanks for any clues
Claude

Replies are listed 'Best First'.
Re: Regex question: Is there a better way?
by Beatnik (Parson) on Apr 22, 2001 at 17:53 UTC
    From the HTML::LinkExtor POD...
    use LWP::UserAgent; use HTML::LinkExtor; use URI::URL; $url = "http://www.perl.org/"; # for instance $ua = LWP::UserAgent->new; # Set up a callback that collect image links my @imgs = (); sub callback { my($tag, %attr) = @_; return if $tag ne 'img'; # we only look closer at <img ...> push(@imgs, values %attr); } # Make the parser. Unfortunately, we don't know the base yet # (it might be diffent from $url) $p = HTML::LinkExtor->new(\&callback); # Request document and parse it as it arrives $res = $ua->request(HTTP::Request->new(GET => $url),sub {$p->parse($_[ +0])}); # Expand all image URLs to absolute ones my $base = $res->base; @imgs = map { $_ = url($_, $base)->abs; } @imgs; # Print them out print join("\n", @imgs), "\n";

    Greetz
    Beatnik
    ... Quidquid perl dictum sit, altum viditur.
      Thanks for the HTML::LinkExtor solution.
      This will accomplish the task of expanding the image urls.

      But, alas, I'm trying to expand my regex ability.
      It might be that there isn't a regex that will process expressions with not "http and not "/. But that's what I'm looking for.

      my $content =<<"(END)"; jfds k blah="http:/stufff" fjksldf jsdf jsdlfjs jflds fjsf jfdj blah="/some other stuff" fjsd fjslf s fjs fjs fjsfj fjsd jjfd jfdjlkf blah="stuff I'm lookig for" fjdls fsf sjfks (END)
      Is there a regex that will focus on blah="stuff I'm lookig for" and skip over blah="http:/stufff" and blah="/some other stuff"?

      Thanks
      Claude

        Your problem here is that you need to ignore quotes like:

        HREF="http://whatever" target="blank"

        being picked up. First you need to strip the HTML.

        Then run a reg exp on what's left that suit's your needs. If you're sure all quotes are 'well formed' (ie they each quote is closed), you can use something as simple as:

        /"([^"]*?)"/g

        Is this more helpful?

        cLive ;-)

Re: Regex question: here's the lazy way
by cLive ;-) (Prior) on Apr 23, 2001 at 00:16 UTC
    add in a base href tag:
    # create html my $base = qq(<BASE HREF="$urltodir">); # add base tag to page - assumes all pages well formed $content =~ s/(<HEAD>)/$1$base/is;

    That way you don't have to run a bunch of regexps on the page - good practice, but not needed :)

    cLive ;-)