in reply to Re^2: converting unicode string to ascii or encoded
in thread converting unicode string to ascii or encoded

I get an error ...

Let me guess: It starts with THIS IS A TOP SECRET ERROR MESSAGE! NEVER POST THIS ERROR MESSAGE ANYWHERE! ESPECIALLY NOT AT PERLMONKS! A KITTEN WILL DIE IF YOU POST IT!.

... when I parse xml in XML::Parser when it gets to a unicode character.

So the XML is likely broken. Did you try to validate it? If the validation fails, the software that generated the XML has a bug. Also try to read the XML using XML::LibXML.

Maybe the XML has an unusual encoding? Default is UTF-8, but ISO-8859-1 and Windows-1252 are quite common. Perhaps the XML lacks an explicit encoding declaration, but uses a non-UTF-8 encoding?

Maybe XML::Parser has problems with XML delivered in a non-UTF-8 encoding? There is a clear hint in the documentation that you need to install some extra files for encodings other than UTF-8, ISO-8859-1, UTF-16 and US-ASCII.

The company I am writing code for wants the unicode characters converted this way.

"Der Kunde ist König." (The customer is king.) But still, this is just stupid. Dropping accents, tildes and other "letter add-ons" can sometimes change the meaning of the text.

Alexander

--
Today I will gladly share my knowledge and experience, for there are no sweeter words than "I told you so". ;-)
  • Comment on Re^3: converting unicode string to ascii or encoded

Replies are listed 'Best First'.
Re^4: converting unicode string to ascii or encoded
by dmn001 (Initiate) on Apr 17, 2011 at 16:05 UTC
    That is correct, the source xml is broken as it contains invalid characters, and it cannot even be rendered in firefox.

    I would post the error message, but it is not really relevant to what I am trying to solve and it would take me a while to find the code and run it as well. In the meantime, I have changed to using HTML::Tokeparser as it seems to be less strict between what is well defined xml and a bunch of text with tags in it.

    Thank you for your reply, and i will take onboard the suggestions to look up different encodings and see if one works.