Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Is there an easy way to extract the alt text of links that are images from a HTML document? Thanks for your help! S.

Replies are listed 'Best First'.
Re: Extracting ALT text from image links
by rob_au (Abbot) on Jun 25, 2002 at 12:47 UTC
    How about this using the ever-venerable HTML::TokeParser? Also too, have a look at the HTML::TokeParser tutorial on this site here
    use HTML::TokeParser; use LWP::Simple; my $content = get('http://www.yoursite.com'); my (@alt, $link); my $parser = HTML::TokeParser->new(\$content) || die $!; while (my $token = $parser->get_token) { my $type = shift @{$token}; if ($type eq 'E') { my ($tag) = @{$token}; $link = 0 if $tag eq 'a'; } elsif ($type eq 'S') { my ($tag, $attr, $attrseq, $text) = @{$token}; $link = 1 if $tag eq 'a'; next unless $tag eq 'img'; next unless defined $attr->{'alt'} and length $attr->{'alt'}; push @alt, { $attr->{'src'} => $attr->{'alt'} } if $link; } }

     

Re: Extracting ALT text from image links
by broquaint (Abbot) on Jun 25, 2002 at 13:23 UTC
    There's also the oft-neglected HTML::PullParser to come to your aid. Here's a non-complete example of how you might use it
    use strict; use HTML::PullParser; my $p = HTML::PullParser->new( file => shift @ARGV, start => 'tagname, @attr' ); while(my $t = $p->get_token()) { my($tagname, %attr) = @$t; print "alt text is $attr{alt}", $/ if exists $attr{alt}; }
    Remember to check out the docs for more info on the module (specifically the start and end events will be needed to get img tags from *within* a tags).
    HTH

    _________
    broquaint

Re: Extracting ALT text from image links
by gav^ (Curate) on Jun 25, 2002 at 14:23 UTC
    Just for completeness, here is an example using HTML::TreeBuilder:
    use HTML::TreeBuilder; my $tree = HTML::TreeBuilder->new_from_content($html); foreach my $img ($tree->look_down('_tag', 'img')) { if ($img->attr('alt')) { print "Alt tag found: ", $img->attr('alt'), "\n"; } } $tree->delete;

    gav^

Re: Extracting ALT text from image links
by Matts (Deacon) on Jun 25, 2002 at 15:13 UTC
    Ooh, lots of different solutions. Here's one using XML::LibXML:

    #!/usr/bin/perl -w use strict; use XML::LibXML; my $file = $ARGV[0] || die "Usage: $0 [uri|filename]\n"; my $doc = XML::LibXML->new->parse_html_file($file); print "Alt tags in $file:\n"; foreach my $alt ($doc->findnodes('//img/@alt')) { print "Alt tag: ", $alt->nodeValue, "\n"; } print "Done\n";
Re: Extracting ALT text from image links
by Jenda (Abbot) on Jun 25, 2002 at 20:05 UTC

    For completeness sake ... this time using HTML::Parser:

    use HTML::Parser; $p = HTML::Parser->new( api_version => 3, start_h => [\&start, "tagname, attr"], end_h => [\&end, "tagname"], marked_sections => 1, ); { my $in_link = 0; sub start { my($tagname, $attr) = @_; if ($tagname eq 'a') { $in_link = 1; } elsif ($in_link and $tagname eq 'img' and exists $attr->{alt +}) { print "IMG: $attr->{src} = $attr->{alt}\n"; } } sub end { $in_link = 0 if ($_[0] eq 'a'); } } $p->parse('sadf dsfg<a href="foo.html"><iMg src="foo.gif" alt="blah">< +/a> <img src="bar.gif" alt="nenene"> sdf'); $p->eof();

      Jenda