in reply to Re: Losing <br/> tags after parsing with HTML::TreeBuilder
in thread Losing <br/> tags after parsing with HTML::TreeBuilder

perl 5, version 40, subversion 2 (v5.40.2) built for x86_64-linux-thread-multi

HTML::Tree 5.07

I can't reproduce it with the strings only, either. Very frustrating.

#!/bin/perl use strict; use utf8; use open IO => ':utf8'; use open ':std', ':utf8'; use feature 'unicode_strings'; # You get funky results with the sub co +nvertNumberedSequencesToChar without this. use HTML::TreeBuilder 5 -weak; my $tree = HTML::TreeBuilder->new; # HTML given is read from a file in UTF-8 format. Using parse_file + returns garbled characters. my $html = '<i>adv.</i><br/><b>1</b> <span lang="pt">maldosamente< +/span><br/><b>2</b> <span lang="pt">maliciosamente</span><br/><b>3</b +> <span lang="pt">intencionalmente</span>'; # The next 2 lines are added to see whether it would impact losing + <br/> tags $tree->no_space_compacting(1); $tree->ignore_unknown(0); $tree->parse( $html ); $tree->eof(); $tree->dump(); print( $tree->as_HTML('&<>') );

results in:

<html><head></head><body><i>adv.</i><br /><b>1</b> <span lang="pt">mal +dosamente</span><br /><b>2</b> <span lang="pt">maliciosamente</span>< +br /><b>3</b> <span lang="pt">intencionalmente</span></body></html>

Parsing the whole file also shows the <br/> tags. However, after adding attributes, detaching and undefining the attributes, the tags are gone.

$Node->attr("size_of_contents") = scalar( $Node->content_list ); $Node->detach(); $Node->attr("size_of_contents") = undef; $Node->dump();

No <br/> tags found anymore. I am sure I am missing something.

Replies are listed 'Best First'.
Re^3: Losing <br/> tags after parsing with HTML::TreeBuilder
by choroba (Cardinal) on Jun 05, 2025 at 09:49 UTC
    The size shouldn't matter, but the contents might. Try removing parts of the file and check whether the behaviour stays, try to get the smallest possible file that still shows the symptoms.

    map{substr$_->[0],$_->[1]||0,1}[\*||{},3],[[]],[ref qr-1,-,-1],[{}],[sub{}^*ARGV,3]

      Found it! I was looking at reloaded intermediate data! Normally, they are turned off if testing is on, but....wrote those lines late last night and forgot about them.

      Thank you very much for the help. I wouldn't have found it if I hadn't written tests after your replies that TreeBuilder wasn't responsible.