traveler has asked for the wisdom of the Perl Monks concerning the following question:

I strongly believe this has been done before, but I can't find any modules on CPAN or code here to do it (I probably searched on the wrong terms). I want to search plain text (not html) and extract all the URLs. It has been correctly said here many times that a regex is not good enough to parse URLs. True. So is there some module that works with plain text and not html?

Replies are listed 'Best First'.
Re: URLs in plain text
by jdtoronto (Prior) on Nov 11, 2003 at 16:34 UTC
    And then there is the module that does it all - I am a regex scaredy cat!

    URI::Find This module does one thing: Finds URIs and URLs in plain text. It finds them quickly and it finds them all(or what URI::URL considers a URI to be.) It only finds URIs which include a scheme (http:// or the like), for something a bit less strict have a look at URI::Find::Schemeless.

    jdtoronto

Re: URLs in plain text
by Abigail-II (Bishop) on Nov 11, 2003 at 16:24 UTC
    Regexes can parse many forms of URLs, including the most common ones. Here's a regex for HTTP URIs:
    (?:(?:http)://(?:(?:(?:(?:(?:(?:[a-zA-Z0-9][-a-zA-Z0-9]*)?[a-zA-Z0-9]) +[.])*(?:[a -zA-Z][-a-zA-Z0-9]*[a-zA-Z0-9]|[a-zA-Z])[.]?)|(?:[0-9]+[.][0-9]+[.][0- +9]+[.][0-9 ]+)))(?::(?:(?:[0-9]*)))?(?:/(?:(?:(?:(?:(?:(?:[a-zA-Z0-9\-_.!~*'():@& +=+$,]+|(?: %[a-fA-F0-9][a-fA-F0-9]))*)(?:;(?:(?:[a-zA-Z0-9\-_.!~*'():@&=+$,]+|(?: +%[a-fA-F0- 9][a-fA-F0-9]))*))*)(?:/(?:(?:(?:[a-zA-Z0-9\-_.!~*'():@&=+$,]+|(?:%[a- +fA-F0-9][a -fA-F0-9]))*)(?:;(?:(?:[a-zA-Z0-9\-_.!~*'():@&=+$,]+|(?:%[a-fA-F0-9][a +-fA-F0-9]) )*))*))*))(?:[?](?:(?:(?:[;/?:@&=+$,a-zA-Z0-9\-_.!~*'()]+|(?:%[a-fA-F0 +-9][a-fA-F 0-9]))*)))?))?)
    Alternatively, you may want to use the Regexp::Common module:
    use Regexp::Common; print $&, "\n" while $txt =~ /$RE{URI}/g;

    Abigail

      Oh, my bleeding eyes!! ;) Perhaps an entry for some Obfuscation - lots of nested lookaheads and that's where I got lost.

      $RE{'URI'} for me!

        There shouldn't be any lookaheads in that regexp.

        Abigail

Re: URLs in plain text
by batkins (Chaplain) on Nov 11, 2003 at 16:39 UTC
    URI::Find works for me.
    Are you sure it was a book? Are you sure it wasn't.....nothing?
Re: URLs in plain text
by gjb (Vicar) on Nov 11, 2003 at 16:26 UTC

    Have a look at Regex::Common, there are a number of expressions to extract URLs.

    Hope this helps, -gjb-