Deleting nodes with HTML::TreeBuilder::XPath

mldvx4 has asked for the wisdom of the Perl Monks concerning the following question:

I'd like to process an XHTML document and remove certain types of nodes if they are blank. I've started trying to do that with HTML::TreeBuilder::XPath and seem to find the I want to remove. The following subroutine finds them just fine but appears to fail to delete the found empty nodes:

sub readthefile {
    my ($file)= (@_);
    print qq(File=$file\n);

    my $xhtml = HTML::TreeBuilder::XPath->new;
    $xhtml->implicit_tags(1);
    $xhtml->parse_file($file)
        or die("Could not parse '$file' : $!\n");

    for my $list ($xhtml->findnodes('/html/body//div/ul/li')) {
        if($list->is_empty) {
            print qq(DELETE\n);
            $list->delete(); # this line does not do what I thought it
+ would :(
        }

    }

    print $xhtml->as_XML_indented;
    $xhtml->delete;
    return (1);
}
[download]

As it is, the routine just prints out "DELETE" the right number of times but then at the end still prints the unmodified original XHTML. How do I excise the chosen nodes from the final output if they are empty?

Comment on Deleting nodes with HTML::TreeBuilder::XPath Download Code

Replies are listed 'Best First'.
Re: Deleting nodes with HTML::TreeBuilder::XPath by tangent (Parson) on Jun 20, 2019 at 14:05 UTC
Are you sure you are selecting the right nodes? `'/html/body//div/ul/li'` will select list items, not the list itself. Given this HTML: `<div> <ul> <li>one</li> <li>two</li> <li>three</li> </ul> <!-- List with 2 empty List Items --> <ul> <li></li> <li>two</li> <li></li> </ul> <!-- Empty List --> <ul> </ul> </div` [download] Run on list items, it does delete empty items: `sub delete_empty_list_item { my $xhtml = HTML::TreeBuilder->new; $xhtml->implicit_tags(1); $xhtml->parse_file($file); for my $list_item ($xhtml->findnodes('/html/body//div/ul/li')) { if ($list_item->is_empty) { print qq(DELETE\n); $list_item->delete(); } } print $xhtml->as_XML_indented; $xhtml->eof; } OUTPUT: <div> <ul> <li>one</li> <li>two</li> <li>three</li> </ul> <ul> <li>two</li> </ul> <ul> </ul> </div>` [download] Run on list elements, it deletes the list itself: `sub delete_empty_list { my $xhtml = HTML::TreeBuilder->new; $xhtml->implicit_tags(1); $xhtml->parse_file($file); for my $list ($xhtml->findnodes('/html/body//div/ul')) { if ($list->is_empty) { print qq(DELETE\n); $list->delete(); } } print $xhtml->as_XML_indented; $xhtml->eof; } OUTPUT: <div> <ul> <li>one</li> <li>two</li> <li>three</li> </ul> <ul> <li></li> <li>two</li> <li></li> </ul> </div>` [download] You could combine the two if you need to delete both empty items and empty lists.	[reply] [d/l] [select]
Re: Deleting nodes with HTML::TreeBuilder::XPath by marto (Cardinal) on Jun 20, 2019 at 12:00 UTC
If you are open to alternatives this uses Mojo::DOM: `#!/usr/bin/perl use strict; use warnings; use Mojo::DOM; my $html = '<ul><li>1</li><li>2</li><li></li><li>4</li><li></li><li>si +x</li>'; my $dom = Mojo::DOM->new( $html ); for ( $dom->find('li:empty')->each ){ $_->remove(); } print "$dom\n";` [download] Update: output: `<ul><li>1</li><li>2</li><li>4</li><li>six</li></ul>`	[reply] [d/l] [select]
Re: Deleting nodes with HTML::TreeBuilder::XPath by bliako (Abbot) on Jun 20, 2019 at 12:08 UTC
Changing `if($list->is_empty)` to `if(!$list->is_empty)` should do it. Edit: Thanks to a tangent's answer I realised I misunderstood what you meant by "empty", as in "empty content", please ignore what I posted. Your script works for me, it deletes empty-content li items.	[reply] [d/l] [select]