in reply to URLs in plain text

Regexes can parse many forms of URLs, including the most common ones. Here's a regex for HTTP URIs:
(?:(?:http)://(?:(?:(?:(?:(?:(?:[a-zA-Z0-9][-a-zA-Z0-9]*)?[a-zA-Z0-9]) +[.])*(?:[a -zA-Z][-a-zA-Z0-9]*[a-zA-Z0-9]|[a-zA-Z])[.]?)|(?:[0-9]+[.][0-9]+[.][0- +9]+[.][0-9 ]+)))(?::(?:(?:[0-9]*)))?(?:/(?:(?:(?:(?:(?:(?:[a-zA-Z0-9\-_.!~*'():@& +=+$,]+|(?: %[a-fA-F0-9][a-fA-F0-9]))*)(?:;(?:(?:[a-zA-Z0-9\-_.!~*'():@&=+$,]+|(?: +%[a-fA-F0- 9][a-fA-F0-9]))*))*)(?:/(?:(?:(?:[a-zA-Z0-9\-_.!~*'():@&=+$,]+|(?:%[a- +fA-F0-9][a -fA-F0-9]))*)(?:;(?:(?:[a-zA-Z0-9\-_.!~*'():@&=+$,]+|(?:%[a-fA-F0-9][a +-fA-F0-9]) )*))*))*))(?:[?](?:(?:(?:[;/?:@&=+$,a-zA-Z0-9\-_.!~*'()]+|(?:%[a-fA-F0 +-9][a-fA-F 0-9]))*)))?))?)
Alternatively, you may want to use the Regexp::Common module:
use Regexp::Common; print $&, "\n" while $txt =~ /$RE{URI}/g;

Abigail

Replies are listed 'Best First'.
Re: Re: URLs in plain text
by Purdy (Hermit) on Nov 11, 2003 at 16:34 UTC
    Oh, my bleeding eyes!! ;) Perhaps an entry for some Obfuscation - lots of nested lookaheads and that's where I got lost.

    $RE{'URI'} for me!

      There shouldn't be any lookaheads in that regexp.

      Abigail

        Oh ... you're right, of course. In my limited regexp experience (not that I don't use regexps a lot, but when I do, they're usually very simple things), I used the (?!pattern) as a way to tell if the src attribute values in IMG tags started with http:// or not, so I could do some relative path stuff if necessary. In that experience, I labelled anything that started with (?...) as a lookahead.

        Peace,

        Jason

        PS: Nested clustering, then? ;)