Regex question: Is there a better way?

Xxaxx has asked for the wisdom of the Perl Monks concerning the following question:

In grabbing a webpage from a foreign server I need to modify any relative urls to abosolute.

The code snippet below works but I was wondering if there is a better way.

.....
my ($urltop) = 'http://www.somewhere.com/';
my ($urltodir) = 'http://www.somewhere.com/folder/';

my(@matches) = ($content =~ /href="([^"]*)"/gi);
foreach my $match (@matches) {
        if ($match =~ /^http/i) {
                ## absolute leave alone
    } elsif ($match =~ /^\//) {
        $content =~ s/href="$match"/href="$urltop$match"/gi;
    } else {
        $content =~ s/href="$match"/href="$urltodir\/$match"/gi;
    }
}
[download]

I could do the leading slash stuff with:

$content =~ s/href="^\/([^"]*)"/href="$urltop/$1"/gi;
[download]

But I can't figure out how to do the other matches without the foreach.

Thanks for any clues
Claude

Comment on Regex question: Is there a better way? Select or Download Code

Replies are listed 'Best First'.
Re: Regex question: Is there a better way? by Beatnik (Parson) on Apr 22, 2001 at 17:53 UTC
From the HTML::LinkExtor POD... use LWP::UserAgent; use HTML::LinkExtor; use URI::URL; $url = "http://www.perl.org/"; # for instance $ua = LWP::UserAgent->new; # Set up a callback that collect image links my @imgs = (); sub callback { my($tag, %attr) = @_; return if $tag ne 'img'; # we only look closer at <img ...> push(@imgs, values %attr); } # Make the parser. Unfortunately, we don't know the base yet # (it might be diffent from $url) $p = HTML::LinkExtor->new(\&callback); # Request document and parse it as it arrives $res = $ua->request(HTTP::Request->new(GET => $url),sub {$p->parse($_[ +0])}); # Expand all image URLs to absolute ones my $base = $res->base; @imgs = map { $_ = url($_, $base)->abs; } @imgs; # Print them out print join("\n", @imgs), "\n"; [download] Greetz Beatnik ... Quidquid perl dictum sit, altum viditur.	[reply] [d/l]
Re: Re: Regex question: Is there a better way? by Xxaxx (Monk) on Apr 22, 2001 at 23:25 UTC
Thanks for the HTML::LinkExtor solution. This will accomplish the task of expanding the image urls. But, alas, I'm trying to expand my regex ability. It might be that there isn't a regex that will process expressions with not "http and not "/. But that's what I'm looking for. `my $content =<<"(END)"; jfds k blah="http:/stufff" fjksldf jsdf jsdlfjs jflds fjsf jfdj blah="/some other stuff" fjsd fjslf s fjs fjs fjsfj fjsd jjfd jfdjlkf blah="stuff I'm lookig for" fjdls fsf sjfks (END)` [download] Is there a regex that will focus on blah="stuff I'm lookig for" and skip over blah="http:/stufff" and blah="/some other stuff"? Thanks Claude	[reply] [d/l]
Re: Regex question: Is there a better way? by cLive ;-) (Prior) on Apr 23, 2001 at 00:31 UTC
Your problem here is that you need to ignore quotes like: HREF="http://whatever" target="blank" being picked up. First you need to strip the HTML. Then run a reg exp on what's left that suit's your needs. If you're sure all quotes are 'well formed' (ie they each quote is closed), you can use something as simple as: `/"([^"]*?)"/g` [download] Is this more helpful? cLive ;-)	[reply] [d/l]
Re: Regex question: here's the lazy way by cLive ;-) (Prior) on Apr 23, 2001 at 00:16 UTC
add in a base href tag: `# create html my $base = qq(<BASE HREF="$urltodir">); # add base tag to page - assumes all pages well formed $content =~ s/(<HEAD>)/$1$base/is;` [download] That way you don't have to run a bunch of regexps on the page - good practice, but not needed :) cLive ;-)	[reply] [d/l]