Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Dear Monks, I've got an odd problem. I'm using the code from "urlify" to convert urls embedded in text to links. I use 2 regexps in order to catch urls that start a line, and ones that do not. This strategy works perfectly in the development environment (win) but the regexp that converts beginning of line urls breaks in production (unix).
$text =~ s{^($urls:[$any] +?)(?=[$punc]* [^$any]|$) }{<a href="$1" tar +get="_blank">$1</a>}igox;
I'm baffled, any ideas?

Replies are listed 'Best First'.
Re: url to html regex problem
by merlyn (Sage) on Oct 05, 2001 at 23:32 UTC
    How about using URI::Find instead? No point re-solving that problem. It even includes an example of what you're trying to do:
    Wrap each URI found in an HTML anchor. my $finder = URI::Find->new( sub { my($uri, $orig_uri) = @_; return qq|<a href="$uri">$orig_uri</ +a>|; }); $finder->find(\$text);
    Although that code is wrong. He needs to HTML-entitize the href parameter. Bleh. Here's the corrected code:
    use CGI qw(escapeHTML); my $finder = URI::Find->new( sub { my($uri, $orig_uri) = @_; $_ = escapeHTML("$_") for $uri, $ori +g_uri; return qq|<a href="$uri">$orig_uri</ +a>|; }); $finder->find(\$text);
    I just sent the author a bug report.

    -- Randal L. Schwartz, Perl hacker

Re: url to html regex problem
by suaveant (Parson) on Oct 05, 2001 at 22:58 UTC
    Why do you have two regexps, one for the beginning of a string and one for anywhere in a sting? \b (as shown on that web page) matches a word boundary, which the beginning of a string certainly is...

    UpdateYes, I just tested the code from the web page on my linux box, it worked with urls at the beginning of a line and in the middle of a line...

    Update2 Yours works fine if I replace ^ with \b... but if you are trying to match urls at the beginning of a line that is not at the beginning of a string you need the m modifier on your regexp so the ^ will match after a newline, as well...

                    - Ant
                    - Some of my best work - Fish Dinner

      I use 2 regexps because it was the only way I could see to prevent urls that are already links from being linked again. Also notice that the first regexp is tweaked to preserve line breaks. Here's all the code I'm using for this:
      my $urls = '(http|telnet|gopher|file|wais|ftp|mailto)'; my $ltrs = '\w'; my $gunk = '/#~:.?+=&%@!\-'; my $punc = '.:?\-'; my $junk = qq~="'>~; # added my $any = "${ltrs}${gunk}${punc}"; $text =~ s{([^$junk]\s*\b)($urls:[$any] +?)(?=[$punc]* [^$any]|$) }{$1 +<a href="$2">$2</a>}igox; $text =~ s{^($urls:[$any] +?)(?=[$punc]* [^$any]|$) }{<a href="$1">$1< +/a>}igox;
      I swear this works on windows, but not our BSD. Any speculation on what could cause that? Thanks..
      Replacing ^ with \b is no good because of the presence of links as well as naked urls in the text. My previous reply shows why. You're right that the problem is urls at the beginning of a line in the middle of a string. I've tried both the m and s modifier with no luck. Like I said, it's baffling.
        Well... if urls are always preceded by a > you could do a negative lookahead for a >... but that isn't the greatest...

                        - Ant
                        - Some of my best work - Fish Dinner