I fixed the xml sample you posted (because what you posted was not valid xml), and I made a couple adjustments to your script. Here's the fixed xml:
<?xml version="1.0" encoding="UTF-8"?> <foo> <definition> <property name="irrelevant"></property> </definition> <definition> <property name="youaretheoneiwant"> <![CDATA[ <!doctype html> <html xmlns="http://www.w3.org/1999/xhtml"> <head> <title>Some UTF-8 characters hereí</title> </head> <body> <div>Some more UTF-8 characters hereč</div> <div><span>±There could be UTF-8 characters anywhere</ +span></div> </body></html> ]]> </property> <property name="idontcareaboutyou"> </property> </definition> </foo>
Note the addition of a root element around the two "definition" elements. And here's the fixed script:
#!/usr/bin/perl use strict; use warnings; use utf8; use XML::Twig; use HTML::Entities; use HTML::Parser; my $xml = $ARGV[0] or die "Usage: $0 file.xml\n"; #print $xml; my $twig = XML::Twig->new( pretty_print => 'indented', twig_handlers => { '#CDATA' => \&encodeCorrectly }); $twig->parsefile( $xml ); $twig->flush; exit; sub encodeCorrectly { my( $twig, $property)= @_; my $htmlToEncode = $property->text; my $htmlEncoded = encode_entities( $htmlToEncode, '&\'"[]\200-\377 +' ); # print "\n\n\n" . $htmlEncoded ."\n\n\n"; $property->set_text( $htmlEncoded ); # print "\n\n\n" . $property->text ."\n\n\n"; }
Note that I'm invoking the "encodeCorrectly" handler on #CDATA, rather than on the "property" element that contains the CDATA. (Also, I chose to put the xml data into a file, and provide the file name as a command-line arg, so I'm using "parsefile" instead of just "parse".)

(updated the script to fix the "Usage" text, because I'm compulsive about that.)

For reasons that I don't fully understand, there's something special about the way CDATA is being used in your xml snippet - and how that gets "flushed" - so it was XML::Twig's "flush" operation (not your re-encoding function) that was causing the trouble.

I don't know whether that fix will work for your intended/actual data, but it seems to do what you want for the posted/fixed sample data.

BTW, if it's safe to assume that non-ASCII characters never show up as part of the markup (names of tags or attributes) in your xml data, and if it would be sufficient to use numeric character entity references rather than symbolic ones, you could just re-code your data like this:

perl -CS -pe 's/([^[:ascii:]])/sprintf("&#%d",ord($1))/eg' < orig.xml +> encoded.xml
That just replaces every non-ASCII character with the corresponding &#(numeric); entity.

ANOTHER UPDATE: Actually, it's not so mysterious why the CDATA part of your xml snippet was causing trouble (and actually, it was your encoding function that caused the trouble): in order for the CDATA thing to work as intended, the square brackets need to be left as-is. If you want to try your original code again, just leave the square brackets out of the "encode_entities" call. Also, please see the HTML::Entities docs for handling unicode characters above \377 (U+00FF), in case any such happen to occur in your data. (It's not unusual to run into something like U+2019 used as an apostrophe or single-quote.) Note that the one-liner above handles all code points outside the ASCII range.


In reply to Re: HTML encoding UTF-8 characters in an HTML block by graff
in thread HTML encoding UTF-8 characters in an HTML block by lilalfyalien

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.