Joey The Saint has asked for the wisdom of the Perl Monks concerning the following question:

I've taken this code from two parts of the perl cookbook and I'm having problems making them work together. My goal is to take URLs in a block of text and re-write them according to the following rules:

  1. absolute links (containing a protocol and dns) are to be left alone
  2. links with an absolute path will have a protocol and DNS pre-pended
  3. relative links will have a protocol, DNS and patch pre-pended

That's probably not entirely clear. How about a code snippet:

$server = "http://www.foo.com" $path = "/absolute/path/" $html = ' <a href="/absolute/no/dns">absolute with no dns</a> <a href="http://absolute.with/dns.html">http://absolute.with/dns.ht +ml</a> <a href="relative/without/dns.html">relative/without/dns.html</a> <a href="relative2/without/dns.html">relative2 without dns.html</a> '; $html =~ s/ (<\s* (?:a|img|area) [^>]+?(?:href|src) \s*=\s* ["']? ) ( [^'"\/>] [^'" >]+? ) ([ '"]?>) / $1.sprintf("%s%s", $path, $2).$3 /sigex;

This bit works okay for the first one and the last two, but the middle case (http://) fails because (clearly) I don't have any case that tells it to avoid a leading protocol string (something like http://, ftp://, gopher://, news://, etc.) So I looked at the urlify program in the cookbook chapter 6 and tried this:

$html =~ s/ (<\s* (?:a|img|area) [^>]+?(?:href|src) \s*=\s* ["']? ) ( [^'"\/>(http|telnet|gopher|file|wais|ftp)] [^'" >]+? ) ([ '"]?>) / $1.sprintf("%s%s", $path, $2).$3 /sigex;

Which is _far_ worse since now none of the cases matches. (Well, not entirely true, if I remove the non-match for a leading '/' I can get the first case to match, but that's exactly not what I want.)

I guess this is my question: How can I do a non-match on a string? I want to prevent the http:// links from matching, but I can't seem to get it to play nice. Has anyone else done this?

Oh, and don't worry about the full DNS pre-pending, it's the same problem so when I fix one, the other comes for free. But it someone might have a suggestion on how I could do this all with one pass, I'd love to hear it, as it is I'm planning on doing two passes, the first with the path, the second with the DNS and protocol info.

-J.

Replies are listed 'Best First'.
(arturo) Re: matching the non-presence of a string
by arturo (Vicar) on Mar 20, 2001 at 21:09 UTC

    I'd try something like HTML::LinkExtor to extract the links and then work on them using ordinary conditionals.

    As far as matching strings that don't contain certain patterns, you have a number of options. Within a regex, (?!pattern) will match as long as pattern does not occur in the string. Or you could put the negation outside ($string !~ /pattern/).

    For real robustness without tearing your hair out, though, I really do recommend moving your logic outside of the regular expression, and just do an explicit series of matches against the href contents.

    HTH

    Philosophy can be made out of anything. Or less -- Jerry A. Fodor

      Ah, now I feel silly. You're right, of course. ($!) is what I wanted. My regex has gotten a bit uglier, but it works if I do it this way:

      $nomatch="(?!http|telnet|gopher|...|\"|'|\/| )"; . . . ( $nomatch [^'" >]+? ) . . .

      I don't quite understand why the $nomatch substitution works, but it's probably something simple too and this is a better solution anyway. I just have one place to update if things need to be changed.

      As an aside, since my goal is to re-write the urls in place, I don't see how HTML::LinkExtor would help. I can get the links just fine, I'm just having problems doing the re-writing inline. Am I missing something there?

      -J.

        As an aside, since my goal is to re-write the urls in place, I don't see how HTML::LinkExtor would help. I can get the links just fine, I'm just having problems doing the re-writing inline. Am I missing something there?

        Well, not really I guess. You could try going through line-by-line with a regex, that extracts the links, and then mangles them appropriately in-line. The LinkExtor approach would be to use it to grab all the links, then for each of those links, do an s///g on the text you have (lumped together as a single string) to do the URL mangling. This way *might be* slower (or impractical for other reasons), but given a choice between speed and correctness, my impulse usually lies with correctness. YMMV, of course!

        Philosophy can be made out of anything. Or less -- Jerry A. Fodor

Re: matching the non-presence of a string
by mirod (Canon) on Mar 20, 2001 at 21:37 UTC

    arturo is right, the problem is too complex to be handled just using regexp's (what about the base element?)

    Nonetheless I'll give it a go, as it demonstrate one of my favourite techniques... I just grab the href attribute in the first regexp and then use a subroutine to analyze it and modify it appropriately. I think this should work for simple links:

    #!/bin/perl -w use strict; my $server = "http://www.foo.com"; my $path = "/absolute/path/"; while( <DATA>) { s{<\s*a\s*href=(['"])(.*?)\1\s*>([^<]*)</a>} { build_link( $2, $3) }seg; # hand ou +t the updating to a subroutine print; } sub build_link # here we + can play with the href attribute { my( $href, $text)= @_; if( $href=~ m{^(http|telnet|gopher|file|wais|ftp)://}) { return qq{<a href="$href">$text</a>}; } elsif( $href=~ m{^/}) { return qq{<a href="$server$href">$text</a>}; } else { return qq{<a href="$server$path$href">$text</a>}; } } __DATA__ <a href="/absolute/no/dns">absolute with no dns</a> <a href="http://absolute.with/dns.html">http://absolute.with/dns.html< +/a> <a href="relative/without/dns.html">relative/without/dns.html</a> <a href="relative2/without/dns.html">relative2 without dns.html</a> <a href="relative/without/dns.html">relative/without/dns.html</a> <a name="toto">toto</a>
Re: matching the non-presence of a string
by dvergin (Monsignor) on Mar 20, 2001 at 23:43 UTC
    It's not central to your question, but a friendly reminder in support of good regex habits. You specify /s for each regex but there are no periods in either one. Specifying '/s' simply allows '.' to match line-end characters within multi-line strings.