comment on

You are barking up the wrong tree.

You could make that approach work correctly, but taking data that has already been parsed (by HTML::TreeBuilder in this case), dumping it to an unparsed format (via as_HTML), and reparsing it (via regexes), is a red flag.

Even if it was not a bad idea in general, as_HTML does not always output the one-tag-per-line format that your code would need.

Your task is complicated by the UL&LI tags not occurring within the SPAN tag. By the time you are processing a LI tag, the author in the previous SPAN tag cannot be directly accessed, since the SPAN is before the LI, but not a parent of LI.

Your impulse to iterate over the tags is good. The "my $author;" line would have to be outside the while() loop, though.

find_by_tag_name() accepts multiple tag names, and so will do what you need.

Working, tested code:

#!/usr/bin/env perl
use strict;
use warnings;
use HTML::TreeBuilder;
use Data::Dumper; $Data::Dumper::Sortkeys = 1;

my $tree = HTML::TreeBuilder->new;
$tree->parse( <<'END_OF_HTML' );
    <span> Author_name </span>
    __filler__
    <ul>
      <li> book 1 by Author_name </li>
      <li> book 2 by Author_name </li>
    </ul>
    <span> New_Author </span>
    __filler__
    <ul>
      <li> book 1 by new </li>
    </ul>
END_OF_HTML
$tree->eof;

# Uncomment to show that as_HTML is a bad fit for this task.
# open my $fh , '<', \( $tree->as_HTML('', '  ') ) or die;
# print $_ while <$fh>;
# exit;

my @tags = $tree->find_by_tag_name( qw( span li ) );

my $current_author;
my %book_author;
my %author_books_HoA;
for my $t (@tags) {
    my $tag_name = $t->tag;
    if ( $tag_name eq 'span' ) {
        $current_author = $t->as_trimmed_text;
    }
    elsif ( $tag_name eq 'li' ) {
        next unless $t->parent->tag eq 'ul';

        my $book_title = $t->as_trimmed_text;

        warn if exists $book_author{$book_title};
        $book_author{$book_title} = $current_author;

        push @{ $author_books_HoA{$current_author} }, $book_title;
    }
    else {
        die "Unexpected tag $tag_name"
    }
}

print Dumper \%book_author, \%author_books_HoA;
[download]

Output:

$VAR1 = {
          'book 1 by Author_name' => 'Author_name',
          'book 1 by new' => 'New_Author',
          'book 2 by Author_name' => 'Author_name'
        };
$VAR2 = {
          'Author_name' => [
                             'book 1 by Author_name',
                             'book 2 by Author_name'
                           ],
          'New_Author' => [
                            'book 1 by new'
                          ]
        };
[download]

/em

In reply to Re: Possible to treat an HTML::TreeBuilder object as a filehandle? by Util
in thread Possible to treat an HTML::TreeBuilder object as a filehandle? by jms53

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.