URLs in plain text

traveler has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: URLs in plain text by jdtoronto (Prior) on Nov 11, 2003 at 16:34 UTC
And then there is the module that does it all - I am a regex scaredy cat! URI::Find This module does one thing: Finds URIs and URLs in plain text. It finds them quickly and it finds them all(or what URI::URL considers a URI to be.) It only finds URIs which include a scheme (http:// or the like), for something a bit less strict have a look at URI::Find::Schemeless. jdtoronto	[reply]
Re: URLs in plain text by Abigail-II (Bishop) on Nov 11, 2003 at 16:24 UTC
Regexes can parse many forms of URLs, including the most common ones. Here's a regex for HTTP URIs: (?:(?:http)://(?:(?:(?:(?:(?:(?:[a-zA-Z0-9][-a-zA-Z0-9])?[a-zA-Z0-9]) +[.])(?:[a -zA-Z][-a-zA-Z0-9][a-zA-Z0-9]\|[a-zA-Z])[.]?)\|(?:[0-9]+[.][0-9]+[.][0- +9]+[.][0-9 ]+)))(?::(?:(?:[0-9])))?(?:/(?:(?:(?:(?:(?:(?:[a-zA-Z0-9\-_.!~'():@& +=+$,]+\|(?: %[a-fA-F0-9][a-fA-F0-9])))(?:;(?:(?:[a-zA-Z0-9\-_.!~'():@&=+$,]+\|(?: +%[a-fA-F0- 9][a-fA-F0-9])))))(?:/(?:(?:(?:[a-zA-Z0-9\-_.!~'():@&=+$,]+\|(?:%[a- +fA-F0-9][a -fA-F0-9])))(?:;(?:(?:[a-zA-Z0-9\-_.!~'():@&=+$,]+\|(?:%[a-fA-F0-9][a +-fA-F0-9]) )))))))(?:[?](?:(?:(?:[;/?:@&=+$,a-zA-Z0-9\-_.!~'()]+\|(?:%[a-fA-F0 +-9][a-fA-F 0-9]))*)))?))?) [download] Alternatively, you may want to use the Regexp::Common module: `use Regexp::Common; print $&, "\n" while $txt =~ /$RE{URI}/g;` [download] Abigail	[reply] [d/l] [select]
Re: Re: URLs in plain text by Purdy (Hermit) on Nov 11, 2003 at 16:34 UTC
Oh, my bleeding eyes!! ;) Perhaps an entry for some Obfuscation - lots of nested lookaheads and that's where I got lost. $RE{'URI'} for me!	[reply]
Re: URLs in plain text by Abigail-II (Bishop) on Nov 11, 2003 at 16:56 UTC
There shouldn't be any lookaheads in that regexp. Abigail	[reply]
Re: Re: URLs in plain text by Purdy (Hermit) on Nov 11, 2003 at 17:14 UTC
Re: URLs in plain text by batkins (Chaplain) on Nov 11, 2003 at 16:39 UTC
URI::Find works for me. Are you sure it was a book? Are you sure it wasn't.....nothing?	[reply]
Re: URLs in plain text by gjb (Vicar) on Nov 11, 2003 at 16:26 UTC
Have a look at Regex::Common, there are a number of expressions to extract URLs. Hope this helps, -gjb-	[reply]