in reply to UTF-8 entities in XML/HTML?

ä

This encodes two characters, not one, so it's certainly not what you want.

To avoid getting something like, decode your strings with Encode::decode and then apply utf8::upgrade on the string. (That last step may not be necessary if all you want is to entity-encode the string).

See also Perl an Character Encodings, Encode, perluniintro, perlunicode, perlunifaq.

Replies are listed 'Best First'.
Re^2: UTF-8 entities in XML/HTML?
by Anonymous Monk on Sep 03, 2008 at 15:37 UTC

    Juerd wrote:
    > Are you sure your data is properly *decoded* when
    > you read it from file/socket/database?

    Thanks for answering, Juerd. The script reads it from a RSS file and I have just double-checked: If the bytes are separately encoded, like ä XML::RSS decodes it correctly. How I know it's correct? Well, I am able to read the character displayed within the HTML output (in the HTML source it's unencoded, but since Firefox thinks the encoding is UTF-8, actually set via HTML header, it displays the character as expected.

    Perhaps interesting regarding my just executed test, if I replace one or all instances of separate bytes entities with the (supposedly correct) single code version I get this error in Apache's log: Wide character in print. And what I see in the browser are little squares that contain tiny hex numbers, e.g. C3 and A4.

    If all entities are separate-bytes encoded, there is no error.

    --
    moritz wrote:
    > ...This encodes two characters, not one,
    > so it's certainly not what you want.

    Thanks for your answer! Great, this confirms my finding.

    I will try out what you suggested (Encode::decode + utf8::upgrade).

    Jot