Note the addition of a root element around the two "definition" elements. And here's the fixed script:<?xml version="1.0" encoding="UTF-8"?> <foo> <definition> <property name="irrelevant"></property> </definition> <definition> <property name="youaretheoneiwant"> <![CDATA[ <!doctype html> <html xmlns="http://www.w3.org/1999/xhtml"> <head> <title>Some UTF-8 characters hereí</title> </head> <body> <div>Some more UTF-8 characters hereč</div> <div><span>±There could be UTF-8 characters anywhere</ +span></div> </body></html> ]]> </property> <property name="idontcareaboutyou"> </property> </definition> </foo>
Note that I'm invoking the "encodeCorrectly" handler on #CDATA, rather than on the "property" element that contains the CDATA. (Also, I chose to put the xml data into a file, and provide the file name as a command-line arg, so I'm using "parsefile" instead of just "parse".)#!/usr/bin/perl use strict; use warnings; use utf8; use XML::Twig; use HTML::Entities; use HTML::Parser; my $xml = $ARGV[0] or die "Usage: $0 file.xml\n"; #print $xml; my $twig = XML::Twig->new( pretty_print => 'indented', twig_handlers => { '#CDATA' => \&encodeCorrectly }); $twig->parsefile( $xml ); $twig->flush; exit; sub encodeCorrectly { my( $twig, $property)= @_; my $htmlToEncode = $property->text; my $htmlEncoded = encode_entities( $htmlToEncode, '&\'"[]\200-\377 +' ); # print "\n\n\n" . $htmlEncoded ."\n\n\n"; $property->set_text( $htmlEncoded ); # print "\n\n\n" . $property->text ."\n\n\n"; }
(updated the script to fix the "Usage" text, because I'm compulsive about that.)
For reasons that I don't fully understand, there's something special about the way CDATA is being used in your xml snippet - and how that gets "flushed" - so it was XML::Twig's "flush" operation (not your re-encoding function) that was causing the trouble.
I don't know whether that fix will work for your intended/actual data, but it seems to do what you want for the posted/fixed sample data.
BTW, if it's safe to assume that non-ASCII characters never show up as part of the markup (names of tags or attributes) in your xml data, and if it would be sufficient to use numeric character entity references rather than symbolic ones, you could just re-code your data like this:
That just replaces every non-ASCII character with the corresponding &#(numeric); entity.perl -CS -pe 's/([^[:ascii:]])/sprintf("&#%d",ord($1))/eg' < orig.xml +> encoded.xml
ANOTHER UPDATE: Actually, it's not so mysterious why the CDATA part of your xml snippet was causing trouble (and actually, it was your encoding function that caused the trouble): in order for the CDATA thing to work as intended, the square brackets need to be left as-is. If you want to try your original code again, just leave the square brackets out of the "encode_entities" call. Also, please see the HTML::Entities docs for handling unicode characters above \377 (U+00FF), in case any such happen to occur in your data. (It's not unusual to run into something like U+2019 used as an apostrophe or single-quote.) Note that the one-liner above handles all code points outside the ASCII range.
In reply to Re: HTML encoding UTF-8 characters in an HTML block
by graff
in thread HTML encoding UTF-8 characters in an HTML block
by lilalfyalien
| For: | Use: | ||
| & | & | ||
| < | < | ||
| > | > | ||
| [ | [ | ||
| ] | ] |