regex to identify http:// in html

Massyn has asked for the wisdom of the Perl Monks concerning the following question:

#!/fellow/monks.pl I've been working on somewhat of an HTML generator from raw text files. One of the things I'd like to do is to translate any http://xxx comments in my text to a proper hyperlink... For example http://www.massyn.net should become <a href="http://www.massyn.net">http://www.massyn.net</a>. I was hoping to do it in the attached code, but it's not working entirely as I hoped. I fell in love with regular expressions, but this one is tricky... I can't get the (.+) to stop at the end... Any assistance is much appreciated.


$text =~ s/http:\/\/(.+)\s/<a href=\"http:\/\/$1\">http:\/\/$1<\/a>/;
[download]

Thanks!

     |\/| _. _ _  ._
www. |  |(_|_>_>\/| | .net
                /

The more I learn the more I realise I don't know.
- Albert Einstein

Comment on regex to identify http:// in html Download Code

Replies are listed 'Best First'.
Re: regex to identify http:// in html by merlyn (Sage) on Nov 26, 2005 at 14:38 UTC
I've been working on somewhat of an HTML generator from raw text files. See URI::Find. I have an example of that in my poor-man's chat program. Also see HTML::FromText, which may do the whole job you want done. -- Randal L. Schwartz, Perl hacker Be sure to read my standard disclaimer if this is a reply.	[reply]
Re: regex to identify http:// in html by polypompholyx (Chaplain) on Nov 26, 2005 at 14:03 UTC
For a quick hack, `s{(http://\S+?)(\s+)}{<a href="$1">$1</a>$2}` will do what you ask. The important thing to note is the `\S+?`, which makes the regex non-greedy, i.e. it'll match the minimum amount required for the regex to succeed, rather than the maximum amount, which is what `\S+` or `.*` would do. I've also used `\S` (any non-space character), as it's best to avoid `.` where you can: see death to dot star.	[reply] [d/l] [select]
Re^2: regex to identify http:// in html by sauoq (Abbot) on Nov 26, 2005 at 14:52 UTC
Your use of a non-greedy quantifier isn't best here. You are already specifying \S and, since you are being specific, the non-greediness isn't really buying you anything. (In fact, it's somewhat less efficient.) You can also skip the capturing of space at the end. You are just re-adding it anyway, so just leave it alone to begin with. Your regex would be better written as: `s!(http://\S+)!<a href="$1">$1</a>!g;` [download] And, you might as well catch https as well: `s!(https?://\S+)!<a href="$1">$1</a>!g;` [download] -sauoq "My two cents aren't worth a dime.";	[reply] [d/l] [select]
Re: regex to identify http:// in html by Samy_rio (Vicar) on Nov 26, 2005 at 14:18 UTC
Hi Massyn, Try this, `my $str="McGlaughlin http://www.karayiannis.com and http://www.samy.co +m"; $str =~ s/\b((?:http\:\/\/)\|(?:www\.))([^ ]+)/<a href=\"$&\">$&<\/a>/s +gi; print $str; __END__ McGlaughlin <a href="http://www.karayiannis.com">http://www.karayianni +s.com</a> and <a href="http://www.samy.com">http://www.samy.com</a>` [download] Regards, Velusamy R. eval"print uc\"\\c$_\""for split'','j)@,/6%@0%2,`e@3!-9v2)/@\|6%,53!-9@2~j';	[reply] [d/l] [select]