in reply to transforming html

It sounds like HTML::TreeBuilder isn't dealing with the utf-8 encoding right. However, I think there's an easy solution, hinted at by HTML::Treebuilder and open.

HTML::Treebuilder:

$root = HTML::TreeBuilder->new() $root->parse_file(...)
An important method inherited from HTML::Parser, which see. Current versions of HTML::Parser can take a filespec, or a filehandle object, like *FOO, or some object from class IO::Handle, IO::File, IO::Socket) or the like. I think you should check that a given file exists before calling $root->parse_file($filespec).
Ok, so it accepts file handles? Good...

open:

open(my $fh, "<:encoding(UTF-8)", "filename") || die "can't open UTF-8 encoded filename: $!";
Ok, so we can specify which encoding to use when we open a file? Hmm!

So here's what I'd try. open my $fh, "<:encoding(UTF-8)", $yourOriginalFileName; my $tree = HTML::TreeBuilder::XPath->new; $tree->parse_file($fh); </c> Untested though, but hopefully it helps.

Edit:: made a weird mistake in my code (as well as in the links). Fixed. I hope.

Replies are listed 'Best First'.
Re^2: transforming html
by morgon (Priest) on Sep 29, 2010 at 02:01 UTC
    Great - many thanks!

    Opening the file with the proper encoding and passing the filehandle to parse_file indeed results in the proper entities being used - which is good enough for me.

    For extra points:

    Can you think of a way to make to_HTML emit the utf8-characters as in the source-document without replacing them with html-entities?

      I'm not sure why you'd want this, as in the end it renders pretty much the same, but I suppose you have your reasons.

      After digging through the documentation a bit, I finally found that as_HTML() is defined on HTML::Element and those docs don't really hint at a way to prevent that encoding from happening.

      But we don't let them discourage us that easily, do we? So after diving into the source of HTML::Element and having a look at the code of the as_HTML subroutine, I learned that the entities encoding is handled by HTML::Entities.

      sub as_HTML { # Bla bla bla # Your typical subroutine initial stuff we don't care much about.. +. if ( ... ) { # Some condition I don't really understand since I didn't bother t +o # understand the initial stuff above. But it didn't seem to releva +nt. # A whole lot of stuff happens here, seemingly all dealing with ta +gs, # not with text. else { # it's a text segment # Hey! Cool. # One more line of bla bla bla, before...: HTML::Entities::encode_entities( $node, $entities +) # Yeah, this sounds about right. Let's look at that. # More stuff I didn't bother to look at... }

      Ok, so HTML::Entities is our target now. There's no apparent way to disable entity encoding so we'll have to use the source as our documentation again. *Shrug*, whatever, it's way past bedtime anyway now so I might as well see what I can do.

      # HTML::Entities # First there's a whole lot of POD here, but since I already saw the H +TML # version of that (which wasn't very helpful) I don't really care. # Hey, cool. The actual module begins here. use strict; use vars qw(@ISA @EXPORT @EXPORT_OK $VERSION); use vars qw(%entity2char %char2entity); # Bla bla bla. Oh, wait, that last line looks promising. # Some more stuff for Exporter happens next. I don't care. %entity2char = ( # What follows is a long, long, long mapping of character names # to actual characters. # This list goes on and on and on... Never knew there were so many! ); # Then, suddenly: # Make the opposite mapping while (my($entity, $char) = each(%entity2char)) { $entity =~ s/;\z//; $char2entity{$char} = "&$entity;"; } delete $char2entity{"'"}; # only one-way decoding

      He, he, he. I think we win. Just one line should, theoretically, keep this whole mean machine from replacing your characters with the html entities. It's a bit of a shame, since the original authors of this module went through such a pain to first set up one mapping (which is really a handful of pages long) and then to revert that mapping, but well, they should've made entity-encoding optional in the first place. Just one line, I think (although again it's untested).

      %HTML::Entity::char2entity = (); # Bye bye.

      Addendum: for completeness' sake, you'd put this line somewhere before you begin printing. Something like this should do the trick.

      %HTML::Entity::char2entity = (); # Bye bye. open my $fh, ">", "out.html" or die $!; print $fh "<html><body>" . join("\n", map { $_->as_HTML } ($tit, $sub, $aut, $art) +) . "</body></html>";
        Again, many thanks.

        And I really think the way you answered my questions is exceptional - not just providing a final answer but illustrating what you did to attack the problem (making me feel a little bit guilty I did not put that much effort in myself).

        Really useful.