Beefy Boxes and Bandwidth Generously Provided by pair Networks
"be consistent"
 
PerlMonks  

extracting links from text

by Anonymous Monk
on Sep 04, 2001 at 21:46 UTC ( [id://110101]=perlquestion: print w/replies, xml ) Need Help??

Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Hello, I'm using the following regular expression (borrowed from the Perl Cookbook) to extract links from html:
@links = m/<A[^>]+?HREF\s*=\s*["']?([^'" >]+?)[ '"]?>/sig;
I am wondering if anyone has a suggestion as to how I would extract a list of links from a block of text without any html tags, i.e
"this is a link, http://bob.com/page"
and
"so is this, but with a query string and a trailing period, http://bob.com/page?elem=val&elem2=val2.".

I'm reading through the perlre documentation, but I'm not very good yet, so any suggestions would be helpful.
Thanks!

Replies are listed 'Best First'.
Re: extracting links from text
by damian1301 (Curate) on Sep 04, 2001 at 21:59 UTC
    You might want to check out URI::Find and HTML::LinkExtor

    _______________________________________________________
    s&&q+\+blah+&oe&&s<hlab>}map{s}&&q+}}+;;s;\+;;g;y;ahlb;veaD;&&print;
Re: extracting links from text
by mexnix (Pilgrim) on Sep 04, 2001 at 21:59 UTC
      These all seem useful if I'm looking to parse links out of HTML, but I haven't figured out how to do this in a text document without any HTML tags using these modules.

      Hmmmm. Maybe I've been eating too many retard sandwiches, and should try a diet of "Learning Perl".
Re: extracting links from text
by Cine (Friar) on Sep 04, 2001 at 22:06 UTC
    Gnome-terminal (which I know do this) uses the following scheme:

    If something starts with www or http:// it is recognised as a link up until the next illegal (in a link) char plus newlines. However if the last char is a . or , it is ignored.
    in perl language somthing like:
    @links = m/((?:www|http:\/\/)[anything allowed in a link \n]*)/ig; for (0..@"#links) { $links[$_] =~ s/.$//; $links[$_] =~ tr/\n//d; #Remove inline newlines. }


    T I M T O W T D I

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://110101]
Approved by root
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others romping around the Monastery: (1)
As of 2024-04-25 03:30 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found