url to html regex problem

Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: url to html regex problem by merlyn (Sage) on Oct 05, 2001 at 23:32 UTC
How about using URI::Find instead? No point re-solving that problem. It even includes an example of what you're trying to do: `Wrap each URI found in an HTML anchor. my $finder = URI::Find->new( sub { my($uri, $orig_uri) = @_; return qq\|<a href="$uri">$orig_uri</ +a>\|; }); $finder->find(\$text);` [download] Although that code is wrong. He needs to HTML-entitize the href parameter. Bleh. Here's the corrected code: `use CGI qw(escapeHTML); my $finder = URI::Find->new( sub { my($uri, $orig_uri) = @_; $_ = escapeHTML("$_") for $uri, $ori +g_uri; return qq\|<a href="$uri">$orig_uri</ +a>\|; }); $finder->find(\$text);` [download] I just sent the author a bug report. -- Randal L. Schwartz, Perl hacker	[reply] [d/l] [select]
Re: url to html regex problem by suaveant (Parson) on Oct 05, 2001 at 22:58 UTC
Why do you have two regexps, one for the beginning of a string and one for anywhere in a sting? \b (as shown on that web page) matches a word boundary, which the beginning of a string certainly is... UpdateYes, I just tested the code from the web page on my linux box, it worked with urls at the beginning of a line and in the middle of a line... Update2 Yours works fine if I replace ^ with \b... but if you are trying to match urls at the beginning of a line that is not at the beginning of a string you need the m modifier on your regexp so the ^ will match after a newline, as well... - Ant - Some of my best work - Fish Dinner	[reply]
Re: Re: url to html regex problem by Anonymous Monk on Oct 05, 2001 at 23:18 UTC
I use 2 regexps because it was the only way I could see to prevent urls that are already links from being linked again. Also notice that the first regexp is tweaked to preserve line breaks. Here's all the code I'm using for this: `my $urls = '(http\|telnet\|gopher\|file\|wais\|ftp\|mailto)'; my $ltrs = '\w'; my $gunk = '/#~:.?+=&%@!\-'; my $punc = '.:?\-'; my $junk = qq~="'>~; # added my $any = "${ltrs}${gunk}${punc}"; $text =~ s{([^$junk]\s\b)($urls:[$any] +?)(?=[$punc] [^$any]\|$) }{$1 +<a href="$2">$2</a>}igox; $text =~ s{^($urls:[$any] +?)(?=[$punc]* [^$any]\|$) }{<a href="$1">$1< +/a>}igox;` [download] I swear this works on windows, but not our BSD. Any speculation on what could cause that? Thanks..	[reply] [d/l]
Re: Re: url to html regex problem by Anonymous Monk on Oct 05, 2001 at 23:26 UTC
Replacing ^ with \b is no good because of the presence of links as well as naked urls in the text. My previous reply shows why. You're right that the problem is urls at the beginning of a line in the middle of a string. I've tried both the m and s modifier with no luck. Like I said, it's baffling.	[reply]
Re: Re: Re: url to html regex problem by suaveant (Parson) on Oct 05, 2001 at 23:30 UTC
Well... if urls are always preceded by a > you could do a negative lookahead for a >... but that isn't the greatest... - Ant - Some of my best work - Fish Dinner	[reply]
Re: Re: Re: Re: url to html regex problem by Anonymous Monk on Oct 06, 2001 at 00:53 UTC