How to get undecoded html entities with HTML::TreeBuilder

Krambambuli has asked for the wisdom of the Perl Monks concerning the following question:

Dear Monks, I'm trying to find a way to parse HTML and keep the parsed HTML identical to the original, raw input.

Using HTML::TreeBuilder (thanks, GrandFather!) rocks for most of the needs I have with HTML parsing; however, I'd like now to get the HTML un-decoded and with that, at least for now, I can't find the solution.

The following code:


#!/usr/bin/perl

use warnings;
use strict;

use HTML::TreeBuilder;

my $html = <<'HTML';
<a href="http://www.nowhere.com/?action=a1&amp;param=p1">Some text</a>
HTML

my $tree = HTML::TreeBuilder->new_from_content( $html );

for my $elt ($tree->look_down ('_tag', 'a')) {
    print "\nA " . $elt->attr ('href') . "\n\n";
}
[download]

will print

A http://www.nowhere.com/?action=a1&param=p1
[download]

whereas I'd want to get

A http://www.nowhere.com/?action=a1&amp;param=p1
[download]

instead. Is there a way... ?

Thank you for any ideas.

Comment on How to get undecoded html entities with HTML::TreeBuilder Select or Download Code

Replies are listed 'Best First'.
Re: How to get undecoded html entities with HTML::TreeBuilder by shmem (Chancellor) on May 17, 2007 at 21:40 UTC
TIMTOWTDI for sure :-) One - since decoding/encoding is reversible: `#!/usr/bin/perl use warnings; use strict; use HTML::TreeBuilder; my $html = <<'HTML'; <a href="http://www.nowhere.com/?action=a1&param=p1">Some text</a> HTML my $tree = HTML::TreeBuilder->new_from_content( $html ); for my $elt ($tree->look_down ('_tag', 'a')) { print "\nA " . HTML::Entities::encode($elt->attr ('href')) . "\n\n +"; }` [download] Others might involve hacking the modules you use. --shmem _($_=" "x(1<<5)."?\n".q·/)Oo. G°\ / /\_¯/(q / ---------------------------- \__(m.====·.(_("always off the crowd"))."· ");sub _{s./.($e="'Itrs `mnsgdq Gdbj O`qkdq")=~y/"-y/#-z/;$e.e && print}	[reply] [d/l]
Re^2: How to get undecoded html entities with HTML::TreeBuilder by naikonta (Curate) on May 18, 2007 at 01:42 UTC
I was deceived by this snippet. I expected that I would get an `undefined subroutine &HTML::Entities::encode` error, but in fact, it run OK and printed `A http://www.nowhere.com/?action=a1&param=p1`. I then took a look at the HTML::TreeBuilder source and found that HTML::Entities was indeed use'd there. I rechecked its manual, and I didn't find any result on `entity`, `entities`, `decode`, and `encode` keywords. It also might be worth to note that HTML::Entities is part of HTML-Parser distribution. Open source softwares? Share and enjoy. Make profit from them if you can. Yet, share and enjoy!	[reply] [d/l] [select]
Re^2: How to get undecoded html entities with HTML::TreeBuilder by Krambambuli (Curate) on May 20, 2007 at 09:42 UTC
One - since decoding/encoding is reversible: Unfortunately, this isn't an option. It might well be that the raw HTML contains both encoded as unencoded entities. As in the end I'll need my code to be part of an filter, I'd really need the raw, original text, in order to be able to substitute it when and how appropiate. As to hacking the source modules - I didn't succeed, and I'm not even really sure I can accomplish that, as decoding is done by HTML::Parser in it's C-code part. HTML::Parser offers a method to get the raw attributes - $p->attr_encoded sets a boolean flag when a new parser is built - but I didn't find yet a way to use it via HTML::TreeBuilder. Thank you, anyway.	[reply]