wojtyk has asked for the wisdom of the Perl Monks concerning the following question:

Does anyone know the best way to extract all links of the following format from an HTML page?

<a href = ...><img src= ...></img></a>

Essentially, I want all links that have an image as the "text" of an href. However, the crucial thing I need to know the src location of that image. All existing CPAN modules seem to toss that data away when they "textify" the link (the important img data inside the href is replaced with a useless "[IMG]" tag)

I've tinkered/experimented with everything from HTTP::Mechanize to Link::Extor to HTML::Tree to HTML::Parser to at least a dozen other things. I can't get anything to work right that doesn't textify first.

I really didn't want to homegrow a regex, but I'm running out of options.

Does anyone know the best way to do this?

Replies are listed 'Best First'.
Re: Extracting full links from HTML
by GrandFather (Saint) on Feb 02, 2007 at 10:26 UTC
    use warnings; use strict; use HTML::TreeBuilder; my $str = "<a href ='some\\where'><img src='apple.gif'></img></a>"; my $root = HTML::TreeBuilder->new_from_content($str); for ($root->look_down ('_tag', 'a')) { next if ! $_->look_down ('_tag', 'img'); print $_->as_HTML (); }

    which prints:

    <a href="some\where"><img src="apple.gif"></a>

    should get you started. Take a look at HTML::TreeBuilder and HTML::Element.


    DWIM is Perl's answer to Gödel
Re: Extracting full links from HTML
by wfsp (Abbot) on Feb 02, 2007 at 10:53 UTC
    I agree, you really don't want a regex. :-)

    Here's my go using HTML::TokeParser::Simple.

    #!C:/Perl/bin/perl.exe use strict; use warnings; use HTML::TokeParser::Simple; my $html = q{ <a href="some.html"><img src="an_image.jpg"/></a> }; my $p = HTML::TokeParser::Simple->new(\$html); my $in_anchor; while (my $t = $p->get_token){ if ($t->is_start_tag('a')){ $in_anchor++; next; } if ($t->is_start_tag('img') and $in_anchor){ my $src = $t->get_attr('src'); print "$src\n"; $in_anchor = 0; } }
    output:
    an_image.jpg
Re: Extracting full links from HTML
by smahesh (Pilgrim) on Feb 02, 2007 at 10:12 UTC
    Hi,

    search.cpan.org is your friend. Try HTML::LinkExtor and/or HTML::LinkExtractor modules. Both the modules allow you to specify a callback that can be used to filter the extracted links.

    Additionally, this (or very similar) question has already been asked on perlmonks. Please use the search feature for some of the archived threads and responses.

    Regards,
    Mahesh

Re: Extracting full links from HTML
by mirod (Canon) on Feb 02, 2007 at 10:53 UTC

    If you're a tad familiar with XPath (or if you want to become familiar with it), you can try HTML::TreeBuilder::XPath, which adds XPath support to HTML::TreeBuilder.

    The code looks like this:

    use warnings; use strict; use HTML::TreeBuilder::XPath; my $str = q{<a href ='some\\where'><img src='apple.gif'></img></a> <a href ='some\\where'>not an img</img></a> <a href ='some\\where'><img src='apple.gif'></img> and tex +t</a> }; my $root = HTML::TreeBuilder->new_from_content($str); foreach my $tag ($root->findnodes( '//a[./img]')) { print "link: ", $tag->as_HTML; }

    If you want to capture the links that have only an image in the content, you can replace the condition by //a[./img and string()=""] (that doesn't exactly garantee that there's nothing else in the link, but getting the query 100% right is left as an exercise for the reader).

    Of course if all you're interested in is the value of the src attribute, you can get it directly:

    foreach my $url ($root->findnodes( '//a/img/@src')) { print "link: ", $url->getValue, "\n"; }

    Note that in that case the attribute are returned as HTML::TreeBuilder::XPath::Attribute objects, hence you need to use getValue to get the value. Hummm.... I wonder if that's in the docs, if not I'll add it.

Re: Extracting full links from HTML
by Scott7477 (Chaplain) on Feb 02, 2007 at 18:17 UTC
    Here is code that looks for a link to an HTML page from the command line and generates links to each image found in the HTML page. I just took wfsp's code and swapped out his hardcoded links. Update: Also changed the code so that the full URL of each image prints. I figure that would be handy to allow for downloading any or all of the images if so desired.
    use strict; use LWP::Simple; use HTML::TokeParser::Simple; #usage imglinker http://www.example.com my $url = shift; my $content = get ($url); my $p = HTML::TokeParser::Simple->new(\$content); my $in_anchor; while (my $t = $p->get_token){ if ($t->is_start_tag('a')){ $in_anchor++; next; } if ($t->is_start_tag('img') and $in_anchor){ my $src = $t->get_attr('src'); print $url."/"."$src\n"; $in_anchor = 0; } }
Re: Extracting full links from HTML
by Anonymous Monk on Feb 02, 2007 at 10:14 UTC
Re: Extracting full links from HTML
by OfficeLinebacker (Chaplain) on Feb 03, 2007 at 19:11 UTC
    I'm with Grandpa on this one. I've used HTML::TreeBuilder with good results. The HTML::Elements will have all of the attribs. You can look for all elements in a tree that have a 'src' attrib, all links, whatever.

    I like computer programming because it's like Legos for the mind.
      TreeBuilder is actually what I ended up doing, but it really felt and looked ugly. I'm really shocked Mechanize doesn't already do this already in its link extraction. I mean a Mechanize::Image class exists. There's no reason to "textify" anything when you could just include a reference to a Mechanize::Image object. Thanks though everybody :)
        I agree with most of you. But what will you do? when the image in an input tag like : <input name="image1" type="image" src="images/go.gif" align="middle" width="25" height="19" border="0"> How to extract such images?