Markismus has asked for the wisdom of the Perl Monks concerning the following question:
I am extracting keyword and definition pairs from a large html document with HTML::Treebuilder and HTML::Element. After parsing the html string and dumping nodes, I find that the <br/> tag is missing.
How could I prevent that>?
use HTML::TreeBuilder 5 -weak; my $tree = HTML::TreeBuilder->new; # HTML given is read from a file in UTF-8 format. Using parse_file + returns garbled characters. my $html = shift; # The next 2 lines are added to see whether it would impact losing + <br/> tags $tree->no_space_compacting(1); $tree->ignore_unknown(0); $tree->parse( $html ); $tree->eof(); ... ... $Definition = $DefinitionNode->as_HTML('<>&');
Part of the HTML-string input:
<i>adv.</i><br/><b>1</b> <span lang="pt">maldosamente</span><br/>< +b>2</b> <span lang="pt">maliciosamente</span><br/><b>3</b> <span lang +="pt">intencionalmente</span>
And the resulting output:
<i>adv.</i><b>1</b> <span lang="pt">maldosamente</span><b>2</b> <s +pan lang="pt">maliciosamente</span><b>3</b> <span lang="pt">intencion +almente</span>
|
---|
Replies are listed 'Best First'. | |
---|---|
Re: Losing <br/> tags after parsing with HTML::TreeBuilder
by choroba (Cardinal) on Jun 05, 2025 at 09:04 UTC | |
by 1nickt (Canon) on Jun 05, 2025 at 09:38 UTC | |
by Markismus (Acolyte) on Jun 05, 2025 at 09:45 UTC | |
by choroba (Cardinal) on Jun 05, 2025 at 09:49 UTC | |
by Markismus (Acolyte) on Jun 05, 2025 at 11:11 UTC |