in reply to Parsing HTML - once again

You may also be interested in HTML::TreeBuilder. Consider:

use warnings; use strict; use HTML::TreeBuilder; my $html = <<'HTML'; <a href="URL1"> text1.0 <img src="SRC1"> text1.1 <br> <garbage2> text1.2 <a href="URL2"> text2.0 <img src="SRC2"> text2.1 <br> <garbage2> <garbage3> text2.2 <a href="URL3"> text3.0 <img src="SRC3"> text3.1 <br> <garbage3> text3.2 <x1 a="a" b="b"> text3.3 </a> text2.3 <haha> </a> text1.3 <oho> <a href="URL4"> text4.0 <img src="SRC4"> text4.1 <br> <garbage4> text4.2 <x1 a="a" b="b"> text4.3 </a> text1.4 </a> HTML my $tree = HTML::TreeBuilder->new_from_content ($html); for my $elt ($tree->look_down ('_tag', 'a')) { print "A " . $elt->attr ('href') . "\n\tTEXT: '"; my @text_segs; for my $child ($elt->content_list ()) { next if ref $child and $child->{_tag} ne 'a'; last if ref $child; push @text_segs, $child; } print "$_ " for @text_segs; print "\n"; }

Prints:

A URL1 TEXT: ' text1.0 text1.1 text1.2 A URL2 TEXT: ' text2.0 text2.1 text2.2 A URL3 TEXT: ' text3.0 text3.1 text3.2 text3.3 A URL4 TEXT: ' text4.0 text4.1 text4.2 text4.3

DWIM is Perl's answer to Gödel

Replies are listed 'Best First'.
Re^2: Parsing HTML - once again
by Krambambuli (Curate) on May 10, 2007 at 07:03 UTC
    Thank you for this one; I forgot mentioning HTML::Treebuilder, but I had it in mind too.

    I hope to get the time to add the other variations too; once that done, I'll want to uniformize style about the different approaches and then comment a bit about pros or cons for each.

    I'll probably avoid benchmarking (reasoning about No More Meaningless Benchmarks!)- not sure yet. In the end, I hope to have collected together a few code samples that might be a goot reading for all those that step into the html parsing task.

    I'm still looking to find a good title ("Parsing HTML' ?) and a good way to place the whole thing in the end. I'm tempted to make it a set of linked nodes in 'Code catacombs', but I'm unsure yet.

    Thanks again.