Krambambuli has asked for the wisdom of the Perl Monks concerning the following question:

Dear Monks, I'm trying to find a way to parse HTML and keep the parsed HTML identical to the original, raw input.

Using HTML::TreeBuilder (thanks, GrandFather!) rocks for most of the needs I have with HTML parsing; however, I'd like now to get the HTML un-decoded and with that, at least for now, I can't find the solution.

The following code:
#!/usr/bin/perl use warnings; use strict; use HTML::TreeBuilder; my $html = <<'HTML'; <a href="http://www.nowhere.com/?action=a1&amp;param=p1">Some text</a> HTML my $tree = HTML::TreeBuilder->new_from_content( $html ); for my $elt ($tree->look_down ('_tag', 'a')) { print "\nA " . $elt->attr ('href') . "\n\n"; }
will print
A http://www.nowhere.com/?action=a1&param=p1
whereas I'd want to get
A http://www.nowhere.com/?action=a1&amp;param=p1
instead. Is there a way... ?

Thank you for any ideas.

Replies are listed 'Best First'.
Re: How to get undecoded html entities with HTML::TreeBuilder
by shmem (Chancellor) on May 17, 2007 at 21:40 UTC
    TIMTOWTDI for sure :-) One - since decoding/encoding is reversible:
    #!/usr/bin/perl use warnings; use strict; use HTML::TreeBuilder; my $html = <<'HTML'; <a href="http://www.nowhere.com/?action=a1&amp;param=p1">Some text</a> HTML my $tree = HTML::TreeBuilder->new_from_content( $html ); for my $elt ($tree->look_down ('_tag', 'a')) { print "\nA " . HTML::Entities::encode($elt->attr ('href')) . "\n\n +"; }

    Others might involve hacking the modules you use.

    --shmem

    _($_=" "x(1<<5)."?\n".q·/)Oo.  G°\        /
                                  /\_¯/(q    /
    ----------------------------  \__(m.====·.(_("always off the crowd"))."·
    ");sub _{s./.($e="'Itrs `mnsgdq Gdbj O`qkdq")=~y/"-y/#-z/;$e.e && print}
      I was deceived by this snippet. I expected that I would get an undefined subroutine &HTML::Entities::encode error, but in fact, it run OK and printed A http://www.nowhere.com/?action=a1&amp;param=p1. I then took a look at the HTML::TreeBuilder source and found that HTML::Entities was indeed use'd there. I rechecked its manual, and I didn't find any result on entity, entities, decode, and encode keywords. It also might be worth to note that HTML::Entities is part of HTML-Parser distribution.

      Open source softwares? Share and enjoy. Make profit from them if you can. Yet, share and enjoy!

      One - since decoding/encoding is reversible:

      Unfortunately, this isn't an option. It might well be that the raw HTML contains both encoded as unencoded entities. As in the end I'll need my code to be part of an filter, I'd really need the raw, original text, in order to be able to substitute it when and how appropiate.

      As to hacking the source modules - I didn't succeed, and I'm not even really sure I can accomplish that, as decoding is done by HTML::Parser in it's C-code part.

      HTML::Parser offers a method to get the raw attributes - $p->attr_encoded sets a boolean flag when a new parser is built - but I didn't find yet a way to use it via HTML::TreeBuilder.

      Thank you, anyway.