Re: URLs in plain text

Regexes can parse many forms of URLs, including the most common ones. Here's a regex for HTTP URIs:

(?:(?:http)://(?:(?:(?:(?:(?:(?:[a-zA-Z0-9][-a-zA-Z0-9]*)?[a-zA-Z0-9])
+[.])*(?:[a
-zA-Z][-a-zA-Z0-9]*[a-zA-Z0-9]|[a-zA-Z])[.]?)|(?:[0-9]+[.][0-9]+[.][0-
+9]+[.][0-9
]+)))(?::(?:(?:[0-9]*)))?(?:/(?:(?:(?:(?:(?:(?:[a-zA-Z0-9\-_.!~*'():@&
+=+$,]+|(?:
%[a-fA-F0-9][a-fA-F0-9]))*)(?:;(?:(?:[a-zA-Z0-9\-_.!~*'():@&=+$,]+|(?:
+%[a-fA-F0-
9][a-fA-F0-9]))*))*)(?:/(?:(?:(?:[a-zA-Z0-9\-_.!~*'():@&=+$,]+|(?:%[a-
+fA-F0-9][a
-fA-F0-9]))*)(?:;(?:(?:[a-zA-Z0-9\-_.!~*'():@&=+$,]+|(?:%[a-fA-F0-9][a
+-fA-F0-9])
)*))*))*))(?:[?](?:(?:(?:[;/?:@&=+$,a-zA-Z0-9\-_.!~*'()]+|(?:%[a-fA-F0
+-9][a-fA-F
0-9]))*)))?))?)
[download]

Alternatively, you may want to use the Regexp::Common module:

use Regexp::Common;
print $&, "\n" while $txt =~ /$RE{URI}/g;
[download]

Abigail

Comment on Re: URLs in plain text Select or Download Code

Replies are listed 'Best First'.
Re: Re: URLs in plain text by Purdy (Hermit) on Nov 11, 2003 at 16:34 UTC
Oh, my bleeding eyes!! ;) Perhaps an entry for some Obfuscation - lots of nested lookaheads and that's where I got lost. $RE{'URI'} for me!	[reply]
Re: URLs in plain text by Abigail-II (Bishop) on Nov 11, 2003 at 16:56 UTC
There shouldn't be any lookaheads in that regexp. Abigail	[reply]
Re: Re: URLs in plain text by Purdy (Hermit) on Nov 11, 2003 at 17:14 UTC
Oh ... you're right, of course. In my limited regexp experience (not that I don't use regexps a lot, but when I do, they're usually very simple things), I used the `(?!pattern)` as a way to tell if the `src` attribute values in `IMG` tags started with `http://` or not, so I could do some relative path stuff if necessary. In that experience, I labelled anything that started with `(?...)` as a lookahead. Peace, Jason PS: Nested clustering, then? ;)	[reply]