Re: UTF-8 entities in XML/HTML?

Ã¤

This encodes two characters, not one, so it's certainly not what you want.

To avoid getting something like, decode your strings with Encode::decode and then apply utf8::upgrade on the string. (That last step may not be necessary if all you want is to entity-encode the string).

Comment on Re: UTF-8 entities in XML/HTML? Select or Download Code

Replies are listed 'Best First'.
Re^2: UTF-8 entities in XML/HTML? by Anonymous Monk on Sep 03, 2008 at 15:37 UTC
Juerd wrote: > Are you sure your data is properly decoded when > you read it from file/socket/database? Thanks for answering, Juerd. The script reads it from a RSS file and I have just double-checked: If the bytes are separately encoded, like `Ã¤` XML::RSS decodes it correctly. How I know it's correct? Well, I am able to read the character displayed within the HTML output (in the HTML source it's unencoded, but since Firefox thinks the encoding is UTF-8, actually set via HTML header, it displays the character as expected. Perhaps interesting regarding my just executed test, if I replace one or all instances of separate bytes entities with the (supposedly correct) single code version I get this error in Apache's log: Wide character in print. And what I see in the browser are little squares that contain tiny hex numbers, e.g. C3 and A4. If all entities are separate-bytes encoded, there is no error. -- moritz wrote: > ...This encodes two characters, not one, > so it's certainly not what you want. Thanks for your answer! Great, this confirms my finding. I will try out what you suggested (`Encode::decode + utf8::upgrade`). Jot	[reply] [d/l] [select]

Replies are listed 'Best First'.

Re^2: UTF-8 entities in XML/HTML?
by Anonymous Monk on Sep 03, 2008 at 15:37 UTC

Juerd wrote:
> Are you sure your data is properly *decoded* when
> you read it from file/socket/database?

Thanks for answering, Juerd. The script reads it from a RSS file and I have just double-checked: If the bytes are separately encoded, like Ã¤ XML::RSS decodes it correctly. How I know it's correct? Well, I am able to read the character displayed within the HTML output (in the HTML source it's unencoded, but since Firefox thinks the encoding is UTF-8, actually set via HTML header, it displays the character as expected.

Perhaps interesting regarding my just executed test, if I replace one or all instances of separate bytes entities with the (supposedly correct) single code version I get this error in Apache's log: Wide character in print. And what I see in the browser are little squares that contain tiny hex numbers, e.g. C3 and A4.

If all entities are separate-bytes encoded, there is no error.

--
moritz wrote:
> ...This encodes two characters, not one,
> so it's certainly not what you want.

Thanks for your answer! Great, this confirms my finding.

I will try out what you suggested (Encode::decode + utf8::upgrade).

Jot

[reply]
[d/l]
[select]