dmn001 has asked for the wisdom of the Perl Monks concerning the following question:

Hi,

I am looking for a perl module that converts any unicode characters in a string to its closest equivalent letter in ascii, (and perhaps, otherwise, the encoded value of it, for example: &#123)

my $x = 'Château';

some_func($x);

output: 'Chateau';

Note the 'a' has turned into a proper letter 'a', Im not too worried about losing the tilde information, as long as it is human readible, and xml parseable.

Never mind, I found the solution to encode it here: http://perl-xml.sourceforge.net/faq/#encoding_conversion

I guess the solution here is to make a hash of all the unicode letters and map them to the equivalent letter in ascii.

Replies are listed 'Best First'.
Re: converting unicode string to ascii or encoded
by ikegami (Patriarch) on Apr 14, 2011 at 23:49 UTC
Re: converting unicode string to ascii or encoded
by afoken (Chancellor) on Apr 17, 2011 at 06:55 UTC
    as long as it is human readible, and xml parseable

    Unicode is human readable. XML allows all Unicode characters (except U+0000). What is the real problem?

    Alexander

    --
    Today I will gladly share my knowledge and experience, for there are no sweeter words than "I told you so". ;-)
      I get an error when I parse xml in XML::Parser when it gets to a unicode character.

      The company I am writing code for wants the unicode characters converted this way.
        I get an error ...

        Let me guess: It starts with THIS IS A TOP SECRET ERROR MESSAGE! NEVER POST THIS ERROR MESSAGE ANYWHERE! ESPECIALLY NOT AT PERLMONKS! A KITTEN WILL DIE IF YOU POST IT!.

        ... when I parse xml in XML::Parser when it gets to a unicode character.

        So the XML is likely broken. Did you try to validate it? If the validation fails, the software that generated the XML has a bug. Also try to read the XML using XML::LibXML.

        Maybe the XML has an unusual encoding? Default is UTF-8, but ISO-8859-1 and Windows-1252 are quite common. Perhaps the XML lacks an explicit encoding declaration, but uses a non-UTF-8 encoding?

        Maybe XML::Parser has problems with XML delivered in a non-UTF-8 encoding? There is a clear hint in the documentation that you need to install some extra files for encodings other than UTF-8, ISO-8859-1, UTF-16 and US-ASCII.

        The company I am writing code for wants the unicode characters converted this way.

        "Der Kunde ist König." (The customer is king.) But still, this is just stupid. Dropping accents, tildes and other "letter add-ons" can sometimes change the meaning of the text.

        Alexander

        --
        Today I will gladly share my knowledge and experience, for there are no sweeter words than "I told you so". ;-)