You are barking up the wrong tree.

You could make that approach work correctly, but taking data that has already been parsed (by HTML::TreeBuilder in this case), dumping it to an unparsed format (via as_HTML), and reparsing it (via regexes), is a red flag.

Even if it was not a bad idea in general, as_HTML does not always output the one-tag-per-line format that your code would need.

Your task is complicated by the UL&LI tags not occurring within the SPAN tag. By the time you are processing a LI tag, the author in the previous SPAN tag cannot be directly accessed, since the SPAN is before the LI, but not a parent of LI.

Your impulse to iterate over the tags is good. The "my $author;" line would have to be outside the while() loop, though.

find_by_tag_name() accepts multiple tag names, and so will do what you need.

Working, tested code:

#!/usr/bin/env perl use strict; use warnings; use HTML::TreeBuilder; use Data::Dumper; $Data::Dumper::Sortkeys = 1; my $tree = HTML::TreeBuilder->new; $tree->parse( <<'END_OF_HTML' ); <span> Author_name </span> __filler__ <ul> <li> book 1 by Author_name </li> <li> book 2 by Author_name </li> </ul> <span> New_Author </span> __filler__ <ul> <li> book 1 by new </li> </ul> END_OF_HTML $tree->eof; # Uncomment to show that as_HTML is a bad fit for this task. # open my $fh , '<', \( $tree->as_HTML('', ' ') ) or die; # print $_ while <$fh>; # exit; my @tags = $tree->find_by_tag_name( qw( span li ) ); my $current_author; my %book_author; my %author_books_HoA; for my $t (@tags) { my $tag_name = $t->tag; if ( $tag_name eq 'span' ) { $current_author = $t->as_trimmed_text; } elsif ( $tag_name eq 'li' ) { next unless $t->parent->tag eq 'ul'; my $book_title = $t->as_trimmed_text; warn if exists $book_author{$book_title}; $book_author{$book_title} = $current_author; push @{ $author_books_HoA{$current_author} }, $book_title; } else { die "Unexpected tag $tag_name" } } print Dumper \%book_author, \%author_books_HoA;

Output:

$VAR1 = { 'book 1 by Author_name' => 'Author_name', 'book 1 by new' => 'New_Author', 'book 2 by Author_name' => 'Author_name' }; $VAR2 = { 'Author_name' => [ 'book 1 by Author_name', 'book 2 by Author_name' ], 'New_Author' => [ 'book 1 by new' ] };

/em

In reply to Re: Possible to treat an HTML::TreeBuilder object as a filehandle? by Util
in thread Possible to treat an HTML::TreeBuilder object as a filehandle? by jms53

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.