extracting links from text

Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Hello, I'm using the following regular expression (borrowed from the Perl Cookbook) to extract links from html:

@links = m/<A[^>]+?HREF\s*=\s*["']?([^'" >]+?)[ '"]?>/sig;
[download]

I am wondering if anyone has a suggestion as to how I would extract a list of links from a block of text without any html tags, i.e
"this is a link, http://bob.com/page"
and
"so is this, but with a query string and a trailing period, http://bob.com/page?elem=val&elem2=val2.".

I'm reading through the perlre documentation, but I'm not very good yet, so any suggestions would be helpful.
Thanks!

Comment on extracting links from text Download Code

Replies are listed 'Best First'.
Re: extracting links from text by damian1301 (Curate) on Sep 04, 2001 at 21:59 UTC
You might want to check out URI::Find and HTML::LinkExtor _______________________________________________________ `s&&q+\+blah+&oe&&s<hlab>}map{s}&&q+}}+;;s;\+;;g;y;ahlb;veaD;&&print;` [download]	[reply] [d/l]
Re: extracting links from text by mexnix (Pilgrim) on Sep 04, 2001 at 21:59 UTC
Instead of using a reg. expression, you should take a look at HTML::LinkExtor, HTML::Parser, or HTML::TokeParser. __________________________________________________ s mmgfbs nf, nfyojy m,tr yb-zya-zy,s zfzphz,print; - thanks japhy :) mexnix.perlmonk.org	[reply]
Re: Re: extracting links from text by Anonymous Monk on Sep 04, 2001 at 22:53 UTC
These all seem useful if I'm looking to parse links out of HTML, but I haven't figured out how to do this in a text document without any HTML tags using these modules. Hmmmm. Maybe I've been eating too many retard sandwiches, and should try a diet of "Learning Perl".	[reply]
Re: extracting links from text by Cine (Friar) on Sep 04, 2001 at 22:06 UTC
Gnome-terminal (which I know do this) uses the following scheme: If something starts with www or http:// it is recognised as a link up until the next illegal (in a link) char plus newlines. However if the last char is a . or , it is ignored. in perl language somthing like: `@links = m/((?:www\|http:\/\/)[anything allowed in a link \n]*)/ig; for (0..@"#links) { $links[$_] =~ s/.$//; $links[$_] =~ tr/\n//d; #Remove inline newlines. }` [download] T I M T O W T D I	[reply] [d/l]