Jaap has asked for the wisdom of the Perl Monks concerning the following question:

Monks,

Encoding text to save in an XML document is going ok for most chars like this:
my $string = "Weird char: " . chr(0x2a) . ".\n"; $string =~ s/([\x00-\x1f])/sprintf('&#x%02X;', ord($1))/ge;
But a character like the euro sign (0x20ac) is not converted with this.
When i try to use a char like \x1234 in the regexp, it complains about it. \x{1234} doesn't work either. Any ideas?

Replies are listed 'Best First'.
Re: Encode text to XML €
by joe++ (Friar) on Sep 27, 2002 at 15:20 UTC
    Hi,

    Your XML parser is expecting UTF-8 by default, and that's exactly what you are using here, so why would you change that? Otherwise there is Text::Iconv which can convert almost any character encoding into what you need.

    Using the euro char has its own particular weirdness in connection with XML as explained here: Euro-XML (xml.com). This has more to do with abusive use of control characters on some widely used platforms than anything else, however.

    --
    Cheers, Joe

      I parse the XML myself.
      The problem is that if in XML characters like chr(0x12) are not allowed. so i convert them. But i am unable to convert the longer cht(0x1234) to ሴ.

      I am reading the xml.com article. I hope it helps.
Re: Encode text to XML €
by diotalevi (Canon) on Sep 27, 2002 at 16:54 UTC

    Something else to keep in mind is your XML's default encoding. I ran into problems where valid XML (according to one system) were being rejected by another. It turns out that the original system was assuming the XML was encoded in iso-8859-1 while my system took a default of utf-8. The problem was easily fixed by adding a XML declaration with the correct encoding.

    Default:
    <?xml version='1.0' encoding='utf-8'?>
    
    Correct (for me in this instance):
    <?xml version='1.0' encoding='iso-8859-1'?>
    
Re: Encode text to XML &#x20AC;
by hiseldl (Priest) on Sep 27, 2002 at 15:22 UTC
    Using an encoder such as MIME::Base64 might work better. For example:
    use MIME::Base64; $weird = chr(0x2a); $encoded = encode_base64($weird); #$decoded = decode_base64($encoded); $string = "Weird char: $encoded.\n";

    This is also easier to read and more flexible than a regexp. :)

    --
    hiseldl
    What time is it? It's Camel Time!

Re: Encode text to XML &#x20AC;
by Jaap (Curate) on Sep 27, 2002 at 16:02 UTC
    After reading http://perl-xml.sourceforge.net/faq/ and 'perldoc perlunicode', i realise that the only problem is not being able to use \x{1234} in a regular expression.
      Found it.
      use utf8;
      after adding this to the header of the script, it is possible to use \x{1234} in a regexp. Thanks for all the help guys.