Extracting full links from HTML

wojtyk has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: Extracting full links from HTML by GrandFather (Saint) on Feb 02, 2007 at 10:26 UTC
`use warnings; use strict; use HTML::TreeBuilder; my $str = "<a href ='some\\where'><img src='apple.gif'></img></a>"; my $root = HTML::TreeBuilder->new_from_content($str); for ($root->look_down ('_tag', 'a')) { next if ! $_->look_down ('_tag', 'img'); print $_->as_HTML (); }` [download] which prints: `<a href="some\where"><img src="apple.gif"></a>` [download] should get you started. Take a look at HTML::TreeBuilder and HTML::Element. DWIM is Perl's answer to Gödel	[reply] [d/l] [select]
Re: Extracting full links from HTML by wfsp (Abbot) on Feb 02, 2007 at 10:53 UTC
I agree, you really don't want a regex. :-) Here's my go using HTML::TokeParser::Simple. `#!C:/Perl/bin/perl.exe use strict; use warnings; use HTML::TokeParser::Simple; my $html = q{ <a href="some.html"><img src="an_image.jpg"/></a> }; my $p = HTML::TokeParser::Simple->new(\$html); my $in_anchor; while (my $t = $p->get_token){ if ($t->is_start_tag('a')){ $in_anchor++; next; } if ($t->is_start_tag('img') and $in_anchor){ my $src = $t->get_attr('src'); print "$src\n"; $in_anchor = 0; } }` [download] output: `an_image.jpg` [download]	[reply] [d/l] [select]
Re: Extracting full links from HTML by smahesh (Pilgrim) on Feb 02, 2007 at 10:12 UTC
Hi, search.cpan.org is your friend. Try HTML::LinkExtor and/or HTML::LinkExtractor modules. Both the modules allow you to specify a callback that can be used to filter the extracted links. Additionally, this (or very similar) question has already been asked on perlmonks. Please use the search feature for some of the archived threads and responses. Regards, Mahesh	[reply]
Re: Extracting full links from HTML by mirod (Canon) on Feb 02, 2007 at 10:53 UTC
If you're a tad familiar with XPath (or if you want to become familiar with it), you can try HTML::TreeBuilder::XPath, which adds XPath support to HTML::TreeBuilder. The code looks like this: `use warnings; use strict; use HTML::TreeBuilder::XPath; my $str = q{<a href ='some\\where'><img src='apple.gif'></img></a> <a href ='some\\where'>not an img</img></a> <a href ='some\\where'><img src='apple.gif'></img> and tex +t</a> }; my $root = HTML::TreeBuilder->new_from_content($str); foreach my $tag ($root->findnodes( '//a[./img]')) { print "link: ", $tag->as_HTML; }` [download] If you want to capture the links that have only an image in the content, you can replace the condition by `//a[./img and string()=""]` (that doesn't exactly garantee that there's nothing else in the link, but getting the query 100% right is left as an exercise for the reader). Of course if all you're interested in is the value of the src attribute, you can get it directly: `foreach my $url ($root->findnodes( '//a/img/@src')) { print "link: ", $url->getValue, "\n"; }` [download] Note that in that case the attribute are returned as HTML::TreeBuilder::XPath::Attribute objects, hence you need to use `getValue` to get the value. Hummm.... I wonder if that's in the docs, if not I'll add it.	[reply] [d/l] [select]
Re: Extracting full links from HTML by Scott7477 (Chaplain) on Feb 02, 2007 at 18:17 UTC
Here is code that looks for a link to an HTML page from the command line and generates links to each image found in the HTML page. I just took wfsp's code and swapped out his hardcoded links. Update: Also changed the code so that the full URL of each image prints. I figure that would be handy to allow for downloading any or all of the images if so desired. `use strict; use LWP::Simple; use HTML::TokeParser::Simple; #usage imglinker http://www.example.com my $url = shift; my $content = get ($url); my $p = HTML::TokeParser::Simple->new(\$content); my $in_anchor; while (my $t = $p->get_token){ if ($t->is_start_tag('a')){ $in_anchor++; next; } if ($t->is_start_tag('img') and $in_anchor){ my $src = $t->get_attr('src'); print $url."/"."$src\n"; $in_anchor = 0; } }` [download]	[reply] [d/l]
Re: Extracting full links from HTML by Anonymous Monk on Feb 02, 2007 at 10:14 UTC
HTML::LinkExtractor?	[reply]
Re: Extracting full links from HTML by OfficeLinebacker (Chaplain) on Feb 03, 2007 at 19:11 UTC
I'm with Grandpa on this one. I've used HTML::TreeBuilder with good results. The HTML::Elements will have all of the attribs. You can look for all elements in a tree that have a 'src' attrib, all links, whatever. I like computer programming because it's like Legos for the mind.	[reply]
Re^2: Extracting full links from HTML by wojtyk (Friar) on Feb 05, 2007 at 16:43 UTC
TreeBuilder is actually what I ended up doing, but it really felt and looked ugly. I'm really shocked Mechanize doesn't already do this already in its link extraction. I mean a Mechanize::Image class exists. There's no reason to "textify" anything when you could just include a reference to a Mechanize::Image object. Thanks though everybody :)	[reply]
Re^3: Extracting full links from HTML by Anonymous Monk on Sep 06, 2007 at 21:38 UTC
I agree with most of you. But what will you do? when the image in an input tag like : <input name="image1" type="image" src="images/go.gif" align="middle" width="25" height="19" border="0"> How to extract such images?	[reply]