in reply to Parsing HTML - once again
You may also be interested in HTML::TreeBuilder. Consider:
use warnings; use strict; use HTML::TreeBuilder; my $html = <<'HTML'; <a href="URL1"> text1.0 <img src="SRC1"> text1.1 <br> <garbage2> text1.2 <a href="URL2"> text2.0 <img src="SRC2"> text2.1 <br> <garbage2> <garbage3> text2.2 <a href="URL3"> text3.0 <img src="SRC3"> text3.1 <br> <garbage3> text3.2 <x1 a="a" b="b"> text3.3 </a> text2.3 <haha> </a> text1.3 <oho> <a href="URL4"> text4.0 <img src="SRC4"> text4.1 <br> <garbage4> text4.2 <x1 a="a" b="b"> text4.3 </a> text1.4 </a> HTML my $tree = HTML::TreeBuilder->new_from_content ($html); for my $elt ($tree->look_down ('_tag', 'a')) { print "A " . $elt->attr ('href') . "\n\tTEXT: '"; my @text_segs; for my $child ($elt->content_list ()) { next if ref $child and $child->{_tag} ne 'a'; last if ref $child; push @text_segs, $child; } print "$_ " for @text_segs; print "\n"; }
Prints:
A URL1 TEXT: ' text1.0 text1.1 text1.2 A URL2 TEXT: ' text2.0 text2.1 text2.2 A URL3 TEXT: ' text3.0 text3.1 text3.2 text3.3 A URL4 TEXT: ' text4.0 text4.1 text4.2 text4.3
|
|---|
| Replies are listed 'Best First'. | |
|---|---|
|
Re^2: Parsing HTML - once again
by Krambambuli (Curate) on May 10, 2007 at 07:03 UTC |