keiusui has asked for the wisdom of the Perl Monks concerning the following question:

I have created a forum where users can post messages. Within messages, I want to replace every word that begins with "http://" with an html link.

For example, if the message was:

This site could be useful: http://www.google.com

then I would want the message to be replaced with:

This site could be useful: <a href="http://www.google.com">http://www.google.com</a>

Here is the Perl code that I have so far for the $message variable:

@message_words = split(/ /, $message); for($x = 0; $x < @message_words; $x++) { if($message_words[$x] =~ /^http:\/\//is) { $message_words[$x] = "<a href=\"$message_words[$x]">$message_words[$ +x]</a>";} } } $message = join(' ', @message_words);

The above code breaks up the message into individual words and replaces words that start with "http://" with an html link.

My problem is that if the word follows a line break rather than a space, then the line break is at the beginning of the word and the "http://" is not detected at all.

Can anyone provide better, more simple code rather than looping through every word?

Just to keep this simple, I am not worrying about special characters within URLs at this time.

Any help would be gratefully appreciated. Thank you so much!

Replies are listed 'Best First'.
Re: detecting URLs and turning them into links
by akho (Hermit) on Aug 23, 2007 at 22:33 UTC
    frodo72's suggestion is short, simple and probably good enough. However, if you insist URLs should follow whitespace, this will work better:
    $message =~ s#(?<=\s)(http://\S+)#<a href="$1">$1</a>#g;
    (using lookbehind assertions, see perldoc perlre). This, however, will not replace the URL that starts in the very first character in $message (since it's not following no whitespace). We can replace that separately:
    $message =~ s#^(http://\S+)#<a href="$1">$1</a>#; $message =~ s#(?<=\s)(http://\S+)#<a href="$1">$1</a>#g;
    URLs are, however, often followed by a punctuation sign, which unfortunately will be included in our link (as in "http://google.com, for example…"). To remove this effect, make URLs end with a letter:
    $message =~ s#^(http://\S+[a-z])#<a href="$1">$1</a>#; $message =~ s#(?<=\s)(http://\S+[a-z])#<a href="$1">$1</a>#g;
    I could go on talking about funny characters and special cases, but I'll just stop here and suggest you use Regexp::Common:
    use Regexp::Common qw( URI ); $message =~ s#^($RE{URI}{HTTP})#<a href="$1">$1</a>#; $message =~ s#(?<=\s)($RE{URI}{HTTP})#<a href="$1">$1</a>#g;
      Regexp::Common won't help much for your purpose here:
      $ perl -le ' use Regexp::Common qw(URI); $_ = "http://www.example.com/ciao,"; s/$RE{URI}{HTTP}/doh!/; print ' doh!
      I elaborated a bit about it here.

      Flavio
      perl -ple'$_=reverse' <<<ti.xittelop@oivalf

      Don't fool yourself.
        In practice Regexp::Common works very well for me.

        I maintain the #perl6 irc logs and URLs are automatically linkified using this regex:

        qr/\b$RE{URI}{HTTP}(?:#[\w_%:-]+)?\b/

        and it works fairly well. I guess about 99% or URLs are handled correctly, and one half of the 1% failures are due to my rather naive handling of anchors.

        Though I have to admit that "my" chatters are mostly geeks who paste URLs with leading http:// (and I ignore other URLs).

Re: detecting URLs and turning them into links
by polettix (Vicar) on Aug 23, 2007 at 21:58 UTC
    Use a regex:
    $message =~ s{(http://\S+)} {<a href="$1">$1</a>}mxsg; # UNTESTED
    Now I have to run, I'm already starting to feel that horde of monks coming to bark at me for this ;)

    Flavio
    perl -ple'$_=reverse' <<<ti.xittelop@oivalf

    Don't fool yourself.
      I recommend against that. Uris are often followed by commas and other punctuation marks in text. Your regexp would include the comma in the uri (like Outlook does/did). It's quite annoying.
        Good that I escaped early, then!

        I only partially agree, and I was a bit too lazy to point all this out in the beginning. The matter of spotting a URI inside a text is quite difficult. But I think that the fact is that many of those characters are valid for a URI, so it's probably an error putting them right after a URI without separating them from the URI itself with a space.

        In particular, according to the standard the full stop is an unreserved character, and the comma and the semicolon are sub-delimiters that play no role in the HTTP scheme. This is why http://www.polettix.it/, and http://www.polettix.it/ciao, are perfectly valid HTTP URIs even with the comma. There is no really correct solution for this: if you keep them you'll annoy your audience most of the time, if you decide to keep them out you're ignoring that they are perfectly valid characters in a URI. I would simply stick to the simplest solution, i.e. the global substitution in this case.

        I also didn't elaborate about why the OP was having difficulties with h(er|is) approach, which was basically due to the fact that splitting on spaces... leaves the newlines. And there are other spacing chars to consider, too, so something along these lines:

        my @parts = split /(\s+)/, $message; # parents preserve spaces # operate on @parts as in the OP my $filtered = join '', @parts;
        would probably be better. But, again, the problem is there for all those nasty punctuation marks, so there is really no advantage over the global substitution (apart from clarifying things for the OP).

        Flavio
        perl -ple'$_=reverse' <<<ti.xittelop@oivalf

        Don't fool yourself.
Re: detecting URLs and turning them into links
by klekker (Pilgrim) on Aug 24, 2007 at 07:38 UTC
    Another approach and not really a direct answer to your question:
    perhaps you like to use bbcode or something similar. Then you could offer your users something like
    Have a look at [url]http://www.perlmonks.org[/url] an write to [email] +joe@localhost[/email]
    to format their posts. You could try BBCode::Parser. I haven tried it, but it sounds promising.

    k
Re: detecting URLs and turning them into links
by thezip (Vicar) on Aug 23, 2007 at 22:23 UTC

    The HTML::Manipulator module might be useful for your application. I haven't used it, but it would seem to be able to do what you're looking for.


    Where do you want *them* to go today?
Re: detecting URLs and turning them into links
by ww (Archbishop) on Aug 24, 2007 at 01:45 UTC
    Lots of good ideas above...so here's a question:

    Are your users sophisticated enough to include the http:// consistently?

    I've observed that some (ahright; "too many") folks think "www.somthing.tld" is a valid URL, because "it works in the address thingee."

Re: detecting URLs and turning them into links
by Anonymous Monk on Aug 24, 2007 at 03:27 UTC