in reply to url to html regex problem

Why do you have two regexps, one for the beginning of a string and one for anywhere in a sting? \b (as shown on that web page) matches a word boundary, which the beginning of a string certainly is...

UpdateYes, I just tested the code from the web page on my linux box, it worked with urls at the beginning of a line and in the middle of a line...

Update2 Yours works fine if I replace ^ with \b... but if you are trying to match urls at the beginning of a line that is not at the beginning of a string you need the m modifier on your regexp so the ^ will match after a newline, as well...

                - Ant
                - Some of my best work - Fish Dinner

Replies are listed 'Best First'.
Re: Re: url to html regex problem
by Anonymous Monk on Oct 05, 2001 at 23:18 UTC
    I use 2 regexps because it was the only way I could see to prevent urls that are already links from being linked again. Also notice that the first regexp is tweaked to preserve line breaks. Here's all the code I'm using for this:
    my $urls = '(http|telnet|gopher|file|wais|ftp|mailto)'; my $ltrs = '\w'; my $gunk = '/#~:.?+=&%@!\-'; my $punc = '.:?\-'; my $junk = qq~="'>~; # added my $any = "${ltrs}${gunk}${punc}"; $text =~ s{([^$junk]\s*\b)($urls:[$any] +?)(?=[$punc]* [^$any]|$) }{$1 +<a href="$2">$2</a>}igox; $text =~ s{^($urls:[$any] +?)(?=[$punc]* [^$any]|$) }{<a href="$1">$1< +/a>}igox;
    I swear this works on windows, but not our BSD. Any speculation on what could cause that? Thanks..
Re: Re: url to html regex problem
by Anonymous Monk on Oct 05, 2001 at 23:26 UTC
    Replacing ^ with \b is no good because of the presence of links as well as naked urls in the text. My previous reply shows why. You're right that the problem is urls at the beginning of a line in the middle of a string. I've tried both the m and s modifier with no luck. Like I said, it's baffling.
      Well... if urls are always preceded by a > you could do a negative lookahead for a >... but that isn't the greatest...

                      - Ant
                      - Some of my best work - Fish Dinner

        I'd tried negative lookahead before without success. But your suggestion made me try harder and I finally got it to work! I had a problem with the order of the characters in $junk being strangely significant until I switched to single quotes:
        my $junk = '<>]="\''; $text =~ s{(?!$junk)\b($urls:[$any] +?)(?=[$punc]* [^$any]|$) }{<a hre +f="$1">$1</a>}migox;
        /me hugs suaveant