Markismus has asked for the wisdom of the Perl Monks concerning the following question:

I am extracting keyword and definition pairs from a large html document with HTML::Treebuilder and HTML::Element. After parsing the html string and dumping nodes, I find that the <br/> tag is missing.

How could I prevent that>?

use HTML::TreeBuilder 5 -weak; my $tree = HTML::TreeBuilder->new; # HTML given is read from a file in UTF-8 format. Using parse_file + returns garbled characters. my $html = shift; # The next 2 lines are added to see whether it would impact losing + <br/> tags $tree->no_space_compacting(1); $tree->ignore_unknown(0); $tree->parse( $html ); $tree->eof(); ... ... $Definition = $DefinitionNode->as_HTML('<>&');

Part of the HTML-string input:

<i>adv.</i><br/><b>1</b> <span lang="pt">maldosamente</span><br/>< +b>2</b> <span lang="pt">maliciosamente</span><br/><b>3</b> <span lang +="pt">intencionalmente</span>

And the resulting output:

<i>adv.</i><b>1</b> <span lang="pt">maldosamente</span><b>2</b> <s +pan lang="pt">maliciosamente</span><b>3</b> <span lang="pt">intencion +almente</span>

Replies are listed 'Best First'.
Re: Losing <br/> tags after parsing with HTML::TreeBuilder
by choroba (Cardinal) on Jun 05, 2025 at 09:04 UTC
    What version of Perl and HTML::TreeBuilder are you using?

    I can't reproduce the reported behaviour. Here's a standalone example:

    #!/usr/bin/perl use warnings; use strict; use HTML::TreeBuilder 5 -weak; my $tree = HTML::TreeBuilder->new; my $content = join "\n", <DATA>; $tree->no_space_compacting(1); $tree->ignore_unknown(0); $tree->parse($content); $tree->eof(); print $tree->as_HTML('<>&'); __DATA__ <html> <head> </head> <body> <i>adv.</i><br/><b>1</b> <span lang="pt">maldosamente</span><br/><b>2< +/b> <span lang="pt">maliciosamente</span><br/><b>3</b> <span lang="pt +">intencionalmente</span> </body></html>

    The output on my machine:

    <html><head></head><body><i>adv.</i><br /><b>1</b> <span lang="pt">mal +dosamente</span><br /><b>2</b> <span lang="pt">maliciosamente</span>< +br /><b>3</b> <span lang="pt">intencionalmente</span></body></html>

    map{substr$_->[0],$_->[1]||0,1}[\*||{},3],[[]],[ref qr-1,-,-1],[{}],[sub{}^*ARGV,3]

      I also get the <br /> tags with Perl 5.40.1 and HTML::TreeBuilder 5.07 on Mac OS.


      The way forward always starts with a minimal test.

      perl 5, version 40, subversion 2 (v5.40.2) built for x86_64-linux-thread-multi

      HTML::Tree 5.07

      I can't reproduce it with the strings only, either. Very frustrating.

      #!/bin/perl use strict; use utf8; use open IO => ':utf8'; use open ':std', ':utf8'; use feature 'unicode_strings'; # You get funky results with the sub co +nvertNumberedSequencesToChar without this. use HTML::TreeBuilder 5 -weak; my $tree = HTML::TreeBuilder->new; # HTML given is read from a file in UTF-8 format. Using parse_file + returns garbled characters. my $html = '<i>adv.</i><br/><b>1</b> <span lang="pt">maldosamente< +/span><br/><b>2</b> <span lang="pt">maliciosamente</span><br/><b>3</b +> <span lang="pt">intencionalmente</span>'; # The next 2 lines are added to see whether it would impact losing + <br/> tags $tree->no_space_compacting(1); $tree->ignore_unknown(0); $tree->parse( $html ); $tree->eof(); $tree->dump(); print( $tree->as_HTML('&<>') );

      results in:

      <html><head></head><body><i>adv.</i><br /><b>1</b> <span lang="pt">mal +dosamente</span><br /><b>2</b> <span lang="pt">maliciosamente</span>< +br /><b>3</b> <span lang="pt">intencionalmente</span></body></html>

      Parsing the whole file also shows the <br/> tags. However, after adding attributes, detaching and undefining the attributes, the tags are gone.

      $Node->attr("size_of_contents") = scalar( $Node->content_list ); $Node->detach(); $Node->attr("size_of_contents") = undef; $Node->dump();

      No <br/> tags found anymore. I am sure I am missing something.

        The size shouldn't matter, but the contents might. Try removing parts of the file and check whether the behaviour stays, try to get the smallest possible file that still shows the symptoms.

        map{substr$_->[0],$_->[1]||0,1}[\*||{},3],[[]],[ref qr-1,-,-1],[{}],[sub{}^*ARGV,3]