Re: Parsing HTML - once again

You may also be interested in HTML::TreeBuilder. Consider:

use warnings;
use strict;
use HTML::TreeBuilder;

my $html = <<'HTML';
<a href="URL1"> text1.0 <img src="SRC1"> text1.1 <br>
                <garbage2> text1.2
   <a href="URL2"> text2.0 <img src="SRC2"> text2.1 <br> <garbage2>
                   <garbage3> text2.2
      <a href="URL3"> text3.0 <img src="SRC3"> text3.1 <br> <garbage3>
            text3.2 <x1 a="a" b="b"> text3.3
      </a>
      text2.3 <haha>
   </a>
   text1.3 <oho>
   <a href="URL4"> text4.0 <img src="SRC4"> text4.1 <br> <garbage4>
                    text4.2 <x1 a="a" b="b"> text4.3
   </a>
   text1.4
</a>
HTML

my $tree = HTML::TreeBuilder->new_from_content ($html);

for my $elt ($tree->look_down ('_tag', 'a')) {
    print "A " . $elt->attr ('href') . "\n\tTEXT: '";
    
    my @text_segs;
    
    for my $child ($elt->content_list ()) {
        next if ref $child and $child->{_tag} ne 'a';
        last if ref $child;
        push @text_segs, $child;
    }
    
    print "$_ " for @text_segs;
    print "\n";
}
[download]

Prints:

A URL1
    TEXT: ' text1.0   text1.1    text1.2  
A URL2
    TEXT: ' text2.0   text2.1     text2.2  
A URL3
    TEXT: ' text3.0   text3.1    text3.2  text3.3  
A URL4
    TEXT: ' text4.0   text4.1    text4.2  text4.3
[download]

DWIM is Perl's answer to Gödel

Comment on Re: Parsing HTML - once again Select or Download Code

Replies are listed 'Best First'.
Re^2: Parsing HTML - once again by Krambambuli (Curate) on May 10, 2007 at 07:03 UTC
Thank you for this one; I forgot mentioning HTML::Treebuilder, but I had it in mind too. I hope to get the time to add the other variations too; once that done, I'll want to uniformize style about the different approaches and then comment a bit about pros or cons for each. I'll probably avoid benchmarking (reasoning about No More Meaningless Benchmarks!)- not sure yet. In the end, I hope to have collected together a few code samples that might be a goot reading for all those that step into the html parsing task. I'm still looking to find a good title ("Parsing HTML' ?) and a good way to place the whole thing in the end. I'm tempted to make it a set of linked nodes in 'Code catacombs', but I'm unsure yet. Thanks again.	[reply]

Replies are listed 'Best First'.

Re^2: Parsing HTML - once again
by Krambambuli (Curate) on May 10, 2007 at 07:03 UTC

HTML::Treebuilder

No More Meaningless Benchmarks!

[reply]