extracting web links

drake50 has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: extracting web links by Corion (Patriarch) on Dec 27, 2003 at 19:03 UTC
Both HTML::LinkExtor and HTML::LinkExtractor are very helpful and are the two modules you should look at :-). They are based on HTML::Parser, but take care of the nitty gritty details for you. `perl -MHTTP::Daemon -MHTTP::Response -MLWP::Simple -e ' ; # The $d = new HTTP::Daemon and fork and getprint $d->url and exit;#spider ($c = $d->accept())->get_request(); $c->send_response( new #in the HTTP::Response(200,$_,$_,qq(Just another Perl hacker\n))); ' # web` [download]	[reply] [d/l]
Re: Re: extracting web links by bart (Canon) on Dec 27, 2003 at 19:31 UTC
In addition to the two modules that Corion listed, there's also HTML::SimpleLinkExtor, which provides a simpler interface on top of HTML::LinkExtor. From the description in the POD: This is a simple HTML link extractor designed for the person who does not want to deal with the intricacies of HTML::Parser or the de-referencing needed to get links out of HTML::LinkExtor.	[reply]
Re: Re: extracting web links by drake50 (Pilgrim) on Dec 27, 2003 at 22:12 UTC
I've noticed that my regular expression version finds more links than when I use HTML::Parser. Since HTML::linkExtor and LinkExtractor are based on Parser can I assume that they aren't going to pick up more links than Parser?	[reply]
Re: Re: Re: extracting web links by Corion (Patriarch) on Dec 27, 2003 at 22:22 UTC
All modules will only find links in anchor tags or image links. Your regular expression dosen't seem very valid to me, so I doubt that it will find more links, but it will surely find different links, as it will more or less gobble up anything that remotely looks like a link in double quotes, while leaving out links in single quotes. I'm not sure about your requirements, but for me, any of these modules has always been enough. If you have special requirements as to the nature of links extracted, please state them more specifically and if possible, with examples. `perl -MHTTP::Daemon -MHTTP::Response -MLWP::Simple -e ' ; # The $d = new HTTP::Daemon and fork and getprint $d->url and exit;#spider ($c = $d->accept())->get_request(); $c->send_response( new #in the HTTP::Response(200,$_,$_,qq(Just another Perl hacker\n))); ' # web` [download]	[reply] [d/l]
Re: Re: Re: Re: extracting web links by PodMaster (Abbot) on Dec 28, 2003 at 09:53 UTC
Re: Re: Re: extracting web links by dominix (Deacon) on Dec 27, 2003 at 22:49 UTC
probably more, but what about link like http://user:passwd@site that become specialy interresting when you consider that user or passwd could containt space (or quote ?),that URL can contain comment like "a > b ?", that quote aren't mandatory and finally consider that regex could be memory/CPU hog , and if you still feel like using regexes test things like `perl -Mre=debug -e '" "=~/href\s=\s"([^"\s]+)"\s>/gi' put your URL here ^^` [download] you'll have a better idea of what the regex engine do for you. After that may be* your regex be the solution ... but I doubt. -- dominix	[reply] [d/l]
Re: Re: Re: Re: extracting web links by drake50 (Pilgrim) on Dec 27, 2003 at 22:55 UTC
Re: extracting web links by revdiablo (Prior) on Dec 27, 2003 at 23:10 UTC
Since you are extracting links from html, you should probably use HTML::LinkExtor as recommended, but it might also be useful to point out URI::Find, which is good for extracting links from arbitrary text. It doesn't try to systematically parse the text, it just looks for something resembling a URI. Pretty handy.	[reply]