in reply to Losing <br/> tags after parsing with HTML::TreeBuilder

What version of Perl and HTML::TreeBuilder are you using?

I can't reproduce the reported behaviour. Here's a standalone example:

#!/usr/bin/perl use warnings; use strict; use HTML::TreeBuilder 5 -weak; my $tree = HTML::TreeBuilder->new; my $content = join "\n", <DATA>; $tree->no_space_compacting(1); $tree->ignore_unknown(0); $tree->parse($content); $tree->eof(); print $tree->as_HTML('<>&'); __DATA__ <html> <head> </head> <body> <i>adv.</i><br/><b>1</b> <span lang="pt">maldosamente</span><br/><b>2< +/b> <span lang="pt">maliciosamente</span><br/><b>3</b> <span lang="pt +">intencionalmente</span> </body></html>

The output on my machine:

<html><head></head><body><i>adv.</i><br /><b>1</b> <span lang="pt">mal +dosamente</span><br /><b>2</b> <span lang="pt">maliciosamente</span>< +br /><b>3</b> <span lang="pt">intencionalmente</span></body></html>

map{substr$_->[0],$_->[1]||0,1}[\*||{},3],[[]],[ref qr-1,-,-1],[{}],[sub{}^*ARGV,3]

Replies are listed 'Best First'.
Re^2: Losing <br/> tags after parsing with HTML::TreeBuilder
by 1nickt (Canon) on Jun 05, 2025 at 09:38 UTC

    I also get the <br /> tags with Perl 5.40.1 and HTML::TreeBuilder 5.07 on Mac OS.


    The way forward always starts with a minimal test.
Re^2: Losing <br/> tags after parsing with HTML::TreeBuilder
by Markismus (Acolyte) on Jun 05, 2025 at 09:45 UTC

    perl 5, version 40, subversion 2 (v5.40.2) built for x86_64-linux-thread-multi

    HTML::Tree 5.07

    I can't reproduce it with the strings only, either. Very frustrating.

    #!/bin/perl use strict; use utf8; use open IO => ':utf8'; use open ':std', ':utf8'; use feature 'unicode_strings'; # You get funky results with the sub co +nvertNumberedSequencesToChar without this. use HTML::TreeBuilder 5 -weak; my $tree = HTML::TreeBuilder->new; # HTML given is read from a file in UTF-8 format. Using parse_file + returns garbled characters. my $html = '<i>adv.</i><br/><b>1</b> <span lang="pt">maldosamente< +/span><br/><b>2</b> <span lang="pt">maliciosamente</span><br/><b>3</b +> <span lang="pt">intencionalmente</span>'; # The next 2 lines are added to see whether it would impact losing + <br/> tags $tree->no_space_compacting(1); $tree->ignore_unknown(0); $tree->parse( $html ); $tree->eof(); $tree->dump(); print( $tree->as_HTML('&<>') );

    results in:

    <html><head></head><body><i>adv.</i><br /><b>1</b> <span lang="pt">mal +dosamente</span><br /><b>2</b> <span lang="pt">maliciosamente</span>< +br /><b>3</b> <span lang="pt">intencionalmente</span></body></html>

    Parsing the whole file also shows the <br/> tags. However, after adding attributes, detaching and undefining the attributes, the tags are gone.

    $Node->attr("size_of_contents") = scalar( $Node->content_list ); $Node->detach(); $Node->attr("size_of_contents") = undef; $Node->dump();

    No <br/> tags found anymore. I am sure I am missing something.

      The size shouldn't matter, but the contents might. Try removing parts of the file and check whether the behaviour stays, try to get the smallest possible file that still shows the symptoms.

      map{substr$_->[0],$_->[1]||0,1}[\*||{},3],[[]],[ref qr-1,-,-1],[{}],[sub{}^*ARGV,3]

        Found it! I was looking at reloaded intermediate data! Normally, they are turned off if testing is on, but....wrote those lines late last night and forgot about them.

        Thank you very much for the help. I wouldn't have found it if I hadn't written tests after your replies that TreeBuilder wasn't responsible.