Re: How to get undecoded html entities with HTML::TreeBuilder

TIMTOWTDI for sure :-) One - since decoding/encoding is reversible:

#!/usr/bin/perl

use warnings;
use strict;

use HTML::TreeBuilder;

my $html = <<'HTML';
<a href="http://www.nowhere.com/?action=a1&amp;param=p1">Some text</a>
HTML

my $tree = HTML::TreeBuilder->new_from_content( $html );

for my $elt ($tree->look_down ('_tag', 'a')) {
    print "\nA " . HTML::Entities::encode($elt->attr ('href')) . "\n\n
+";
}
[download]

Others might involve hacking the modules you use.

--shmem

_($_=" "x(1<<5)."?\n".q·/)Oo.  G°\        /
                              /\_¯/(q    /
----------------------------  \__(m.====·.(_("always off the crowd"))."·
");sub _{s./.($e="'Itrs `mnsgdq Gdbj O`qkdq")=~y/"-y/#-z/;$e.e && print}

Comment on Re: How to get undecoded html entities with HTML::TreeBuilder Download Code

Replies are listed 'Best First'.
Re^2: How to get undecoded html entities with HTML::TreeBuilder by naikonta (Curate) on May 18, 2007 at 01:42 UTC
I was deceived by this snippet. I expected that I would get an `undefined subroutine &HTML::Entities::encode` error, but in fact, it run OK and printed `A http://www.nowhere.com/?action=a1&param=p1`. I then took a look at the HTML::TreeBuilder source and found that HTML::Entities was indeed use'd there. I rechecked its manual, and I didn't find any result on `entity`, `entities`, `decode`, and `encode` keywords. It also might be worth to note that HTML::Entities is part of HTML-Parser distribution. Open source softwares? Share and enjoy. Make profit from them if you can. Yet, share and enjoy!	[reply] [d/l] [select]
Re^2: How to get undecoded html entities with HTML::TreeBuilder by Krambambuli (Curate) on May 20, 2007 at 09:42 UTC
One - since decoding/encoding is reversible: Unfortunately, this isn't an option. It might well be that the raw HTML contains both encoded as unencoded entities. As in the end I'll need my code to be part of an filter, I'd really need the raw, original text, in order to be able to substitute it when and how appropiate. As to hacking the source modules - I didn't succeed, and I'm not even really sure I can accomplish that, as decoding is done by HTML::Parser in it's C-code part. HTML::Parser offers a method to get the raw attributes - $p->attr_encoded sets a boolean flag when a new parser is built - but I didn't find yet a way to use it via HTML::TreeBuilder. Thank you, anyway.	[reply]