I have a weekly process that must take some XML input files, parse them for specific data, and output to a fixed width file. Some of these files are quite large (>100 MB), so I use XML::Twig to parse them. This script was no problem at all, until I was told that the program which uses my output files can not handle UTF-8. This was throwing the fixed width off, as they would get two odd characters instead of one UTF-8 character. One other caveat to this script is that it must be run on a Windows box (request from boss, non-negotiable).

One solution I had found was to set the encoding of the output file. I tried using open OUT, ">:encoding(ascii):crlf", 'test.txt';. The crlf is due to Windows, and the encoding was ASCII since that was all that could be handled. The problem with this is it would output a warning (ex: "\x{00e9}" does not map to ascii.) when it came across a non-ASCII character and the output file would contain \x{00e9}. This is unacceptable both for length and readability.

After looking and reading around a bit more, I found the Encode module. Using the from_to function seemed to decode the UTF-8 I was getting into ASCII exactly how I needed. A "?" was substituted for non-ASCII characters. That solution was perfect for readability and maintained the length.

Then, while parsing one of the files, I ran into the error "Cannot decode string with wide characters at Encode.pm line 186.". When looking at the input file through a UTF-8 capable editor (EditPlus), the character in question seems to be a simple space. When I copy+paste this text and run from_to on it directly, there is no problem. This has led me to wonder if XML::Twig::Ent's field function is returning an odd character here. I have no idea how to test this theory out though.

My questions are, has anyone run into this type of error before? If so, how did you solve it? If not, how do you convert from UTF-8 to ASCII?

Thanks all in advance!


In reply to UTF-8 Decoding, Wide Characters, and XML::Twig by thedoe

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.