thedoe has asked for the wisdom of the Perl Monks concerning the following question:
I have a weekly process that must take some XML input files, parse them for specific data, and output to a fixed width file. Some of these files are quite large (>100 MB), so I use XML::Twig to parse them. This script was no problem at all, until I was told that the program which uses my output files can not handle UTF-8. This was throwing the fixed width off, as they would get two odd characters instead of one UTF-8 character. One other caveat to this script is that it must be run on a Windows box (request from boss, non-negotiable).
One solution I had found was to set the encoding of the output file. I tried using open OUT, ">:encoding(ascii):crlf", 'test.txt';. The crlf is due to Windows, and the encoding was ASCII since that was all that could be handled. The problem with this is it would output a warning (ex: "\x{00e9}" does not map to ascii.) when it came across a non-ASCII character and the output file would contain \x{00e9}. This is unacceptable both for length and readability.
After looking and reading around a bit more, I found the Encode module. Using the from_to function seemed to decode the UTF-8 I was getting into ASCII exactly how I needed. A "?" was substituted for non-ASCII characters. That solution was perfect for readability and maintained the length.
Then, while parsing one of the files, I ran into the error "Cannot decode string with wide characters at Encode.pm line 186.". When looking at the input file through a UTF-8 capable editor (EditPlus), the character in question seems to be a simple space. When I copy+paste this text and run from_to on it directly, there is no problem. This has led me to wonder if XML::Twig::Ent's field function is returning an odd character here. I have no idea how to test this theory out though.
My questions are, has anyone run into this type of error before? If so, how did you solve it? If not, how do you convert from UTF-8 to ASCII?
Thanks all in advance!
|
|---|
| Replies are listed 'Best First'. | |
|---|---|
|
Re: UTF-8 Decoding, Wide Characters, and XML::Twig
by samtregar (Abbot) on Feb 03, 2006 at 18:57 UTC | |
|
Re: UTF-8 Decoding, Wide Characters, and XML::Twig
by rhesa (Vicar) on Feb 03, 2006 at 19:17 UTC | |
|
Re: UTF-8 Decoding, Wide Characters, and XML::Twig
by graff (Chancellor) on Feb 04, 2006 at 05:44 UTC | |
|
Re: UTF-8 Decoding, Wide Characters, and XML::Twig
by thedoe (Monk) on Feb 03, 2006 at 20:35 UTC | |
|
Re: UTF-8 Decoding, Wide Characters, and XML::Twig
by john_oshea (Priest) on Feb 04, 2006 at 13:58 UTC | |
|
Re: UTF-8 Decoding, Wide Characters, and XML::Twig
by mirod (Canon) on Feb 03, 2006 at 19:08 UTC | |
|
Re: UTF-8 Decoding, Wide Characters, and XML::Twig
by Anonymous Monk on Oct 21, 2014 at 01:46 UTC |