mldvx4 has asked for the wisdom of the Perl Monks concerning the following question:

I'd like to process an XHTML document and remove certain types of nodes if they are blank. I've started trying to do that with HTML::TreeBuilder::XPath and seem to find the I want to remove. The following subroutine finds them just fine but appears to fail to delete the found empty nodes:

sub readthefile { my ($file)= (@_); print qq(File=$file\n); my $xhtml = HTML::TreeBuilder::XPath->new; $xhtml->implicit_tags(1); $xhtml->parse_file($file) or die("Could not parse '$file' : $!\n"); for my $list ($xhtml->findnodes('/html/body//div/ul/li')) { if($list->is_empty) { print qq(DELETE\n); $list->delete(); # this line does not do what I thought it + would :( } } print $xhtml->as_XML_indented; $xhtml->delete; return (1); }

As it is, the routine just prints out "DELETE" the right number of times but then at the end still prints the unmodified original XHTML. How do I excise the chosen nodes from the final output if they are empty?

Replies are listed 'Best First'.
Re: Deleting nodes with HTML::TreeBuilder::XPath
by tangent (Parson) on Jun 20, 2019 at 14:05 UTC
    Are you sure you are selecting the right nodes?
    '/html/body//div/ul/li' will select list items, not the list itself.

    Given this HTML:

    <div> <ul> <li>one</li> <li>two</li> <li>three</li> </ul> <!-- List with 2 empty List Items --> <ul> <li></li> <li>two</li> <li></li> </ul> <!-- Empty List --> <ul> </ul> </div
    Run on list items, it does delete empty items:
    sub delete_empty_list_item { my $xhtml = HTML::TreeBuilder->new; $xhtml->implicit_tags(1); $xhtml->parse_file($file); for my $list_item ($xhtml->findnodes('/html/body//div/ul/li')) { if ($list_item->is_empty) { print qq(DELETE\n); $list_item->delete(); } } print $xhtml->as_XML_indented; $xhtml->eof; } OUTPUT: <div> <ul> <li>one</li> <li>two</li> <li>three</li> </ul> <ul> <li>two</li> </ul> <ul> </ul> </div>
    Run on list elements, it deletes the list itself:
    sub delete_empty_list { my $xhtml = HTML::TreeBuilder->new; $xhtml->implicit_tags(1); $xhtml->parse_file($file); for my $list ($xhtml->findnodes('/html/body//div/ul')) { if ($list->is_empty) { print qq(DELETE\n); $list->delete(); } } print $xhtml->as_XML_indented; $xhtml->eof; } OUTPUT: <div> <ul> <li>one</li> <li>two</li> <li>three</li> </ul> <ul> <li></li> <li>two</li> <li></li> </ul> </div>
    You could combine the two if you need to delete both empty items and empty lists.

     

Re: Deleting nodes with HTML::TreeBuilder::XPath
by marto (Cardinal) on Jun 20, 2019 at 12:00 UTC

    If you are open to alternatives this uses Mojo::DOM:

    #!/usr/bin/perl use strict; use warnings; use Mojo::DOM; my $html = '<ul><li>1</li><li>2</li><li></li><li>4</li><li></li><li>si +x</li>'; my $dom = Mojo::DOM->new( $html ); for ( $dom->find('li:empty')->each ){ $_->remove(); } print "$dom\n";

    Update: output: <ul><li>1</li><li>2</li><li>4</li><li>six</li></ul>

Re: Deleting nodes with HTML::TreeBuilder::XPath
by bliako (Abbot) on Jun 20, 2019 at 12:08 UTC

    Changing if($list->is_empty) to if(!$list->is_empty) should do it.

    Edit: Thanks to a tangent's answer I realised I misunderstood what you meant by "empty", as in "empty content", please ignore what I posted. Your script works for me, it deletes empty-content li items.