I've taken this code from two parts of the perl cookbook and I'm having problems making them work together. My goal is to take URLs in a block of text and re-write them according to the following rules:

  1. absolute links (containing a protocol and dns) are to be left alone
  2. links with an absolute path will have a protocol and DNS pre-pended
  3. relative links will have a protocol, DNS and patch pre-pended

That's probably not entirely clear. How about a code snippet:

$server = "http://www.foo.com" $path = "/absolute/path/" $html = ' <a href="/absolute/no/dns">absolute with no dns</a> <a href="http://absolute.with/dns.html">http://absolute.with/dns.ht +ml</a> <a href="relative/without/dns.html">relative/without/dns.html</a> <a href="relative2/without/dns.html">relative2 without dns.html</a> '; $html =~ s/ (<\s* (?:a|img|area) [^>]+?(?:href|src) \s*=\s* ["']? ) ( [^'"\/>] [^'" >]+? ) ([ '"]?>) / $1.sprintf("%s%s", $path, $2).$3 /sigex;

This bit works okay for the first one and the last two, but the middle case (http://) fails because (clearly) I don't have any case that tells it to avoid a leading protocol string (something like http://, ftp://, gopher://, news://, etc.) So I looked at the urlify program in the cookbook chapter 6 and tried this:

$html =~ s/ (<\s* (?:a|img|area) [^>]+?(?:href|src) \s*=\s* ["']? ) ( [^'"\/>(http|telnet|gopher|file|wais|ftp)] [^'" >]+? ) ([ '"]?>) / $1.sprintf("%s%s", $path, $2).$3 /sigex;

Which is _far_ worse since now none of the cases matches. (Well, not entirely true, if I remove the non-match for a leading '/' I can get the first case to match, but that's exactly not what I want.)

I guess this is my question: How can I do a non-match on a string? I want to prevent the http:// links from matching, but I can't seem to get it to play nice. Has anyone else done this?

Oh, and don't worry about the full DNS pre-pending, it's the same problem so when I fix one, the other comes for free. But it someone might have a suggestion on how I could do this all with one pass, I'd love to hear it, as it is I'm planning on doing two passes, the first with the path, the second with the DNS and protocol info.

-J.


In reply to matching the non-presence of a string by Joey The Saint

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.