UTF-8 Decoding, Wide Characters, and XML::Twig

thedoe has asked for the wisdom of the Perl Monks concerning the following question:

I have a weekly process that must take some XML input files, parse them for specific data, and output to a fixed width file. Some of these files are quite large (>100 MB), so I use XML::Twig to parse them. This script was no problem at all, until I was told that the program which uses my output files can not handle UTF-8. This was throwing the fixed width off, as they would get two odd characters instead of one UTF-8 character. One other caveat to this script is that it must be run on a Windows box (request from boss, non-negotiable).

One solution I had found was to set the encoding of the output file. I tried using open OUT, ">:encoding(ascii):crlf", 'test.txt';. The crlf is due to Windows, and the encoding was ASCII since that was all that could be handled. The problem with this is it would output a warning (ex: "\x{00e9}" does not map to ascii.) when it came across a non-ASCII character and the output file would contain \x{00e9}. This is unacceptable both for length and readability.

After looking and reading around a bit more, I found the Encode module. Using the from_to function seemed to decode the UTF-8 I was getting into ASCII exactly how I needed. A "?" was substituted for non-ASCII characters. That solution was perfect for readability and maintained the length.

Then, while parsing one of the files, I ran into the error "Cannot decode string with wide characters at Encode.pm line 186.". When looking at the input file through a UTF-8 capable editor (EditPlus), the character in question seems to be a simple space. When I copy+paste this text and run from_to on it directly, there is no problem. This has led me to wonder if XML::Twig::Ent's field function is returning an odd character here. I have no idea how to test this theory out though.

My questions are, has anyone run into this type of error before? If so, how did you solve it? If not, how do you convert from UTF-8 to ASCII?

Thanks all in advance!

Comment on UTF-8 Decoding, Wide Characters, and XML::Twig Download Code

Replies are listed 'Best First'.
Re: UTF-8 Decoding, Wide Characters, and XML::Twig by samtregar (Abbot) on Feb 03, 2006 at 18:57 UTC
If not, how do you convert from UTF-8 to ASCII? The last time I had to do this I was feeding XML data to a Glimpse search indexer. Glimpse doesn't support UTF-8 (or at least, it didn't then) but I needed it to answer queries in UTF-8, so just dropping the non-ASCII characters or replacing them with ? wasn't an option. To solve the problem I converted UTF-8 to UTF-7, a Uniode encoding that uses only 7-bit ASCII characters. Then I did the same thing with the search strings before sending them to Glimpse for matching. It worked great for me, perhaps you can do the same thing. -sam	[reply]
Re: UTF-8 Decoding, Wide Characters, and XML::Twig by rhesa (Vicar) on Feb 03, 2006 at 19:17 UTC
I use Text::Unidecode for this, but I should add that I don't care about fixed width. It's a great module for readability (since it translates e.g. "é" to "e", which I found preferable over replacing every high char with a ?), but I believe it translates e.g. Japanese characters (1 utf char) to for instance "wa"¹. This might screw up your alignment. I know unicode defines several different kinds of spaces (non-breaking or breaking, zero-width, half-width, em- and en- spaces just to name some off the top of my head). It's entirely possible that the utf8 2 ascii translation misses one of these. Depending on the input, you might try a simple regexp like `s/\s/ /g` before translating to ascii, although I don't know exactly which unicode whitespace characters are defined within \s. `_______________________` ¹ I'm not at all familiar with Japanese. Just mentioned it for illustrative purposes. The Text::Unidecode pod has more detailed (and more accurate!) examples.	[reply] [d/l]
Re: UTF-8 Decoding, Wide Characters, and XML::Twig by graff (Chancellor) on Feb 04, 2006 at 05:44 UTC
Given that you are stuck with using MS Windows, and given that the non-ascii data really is single-byte-per-character (and needs to be kept that way), you should be writing your fixed-width file using one of the "legacy" MS character encodings. Which one you need depends on which language is being used in the data: what are the non-ascii characters? If they are just the basic Latin-1 set for "Western" languages (French, German, Spanish) then you probably want CP1252. Or it could just be that the non-ascii characters are those nefarious "smart quotes" and other bothersome punctuation marks being foisted on us all, in which case any CP12?? charset will do. You can use binmode on your output file handle to impose the conversion from perl-internal utf8 to cp1252 (or whatever); this way, no information is lost, 8-bit characters remain 8-bit characters, and the fixed-width lines get the right byte count: `binmode OUTFH, ":encoding(cp1252)";` [download] (Someday, the boss might get the idea that the downstream process that needs the fixed-width file as input ought to accommodate utf8 data, and at that point you'll need to take out the binmode line, or maybe just change ":encoding(cp1252)" with ":utf8".)	[reply] [d/l]
Re: UTF-8 Decoding, Wide Characters, and XML::Twig by thedoe (Monk) on Feb 03, 2006 at 20:35 UTC
The reason I did not post code is because when I tested out the actual text, the program worked just fine. Therefore I wanted to see if someone else had run into this before spending hours doing some serious debugging. I will be working up some test cases later to see if I can reproduce this error. This drove me nuts and, if there is a problem with either of these modules, I would like to nail down exactly what it is and either fix it or at least report it. In the meantime, while the clock is ticking at work, rhesa's response seems to have solved the problem. How and why, I wish I knew, and will have to work on that in my own time. Thanks Everyone!	[reply]
Re: UTF-8 Decoding, Wide Characters, and XML::Twig by john_oshea (Priest) on Feb 04, 2006 at 13:58 UTC
Depending on the origin of your XML file, there's a possibility that what appears to be a 'simple space' may, in fact, be one of the spaces listed here with values above U+00FF. A hex dump of the offending portion of the file may reveal more here.	[reply]
Re: UTF-8 Decoding, Wide Characters, and XML::Twig by mirod (Canon) on Feb 03, 2006 at 19:08 UTC
Can you post some minimal code, example data and expected output so I (we) can figure out what the problem exactly is? A test case (with Test::More for example) would be even better. As the saying (should) go: a test case is worth a thousand words ;--)	[reply]
Re: UTF-8 Decoding, Wide Characters, and XML::Twig by Anonymous Monk on Oct 21, 2014 at 01:46 UTC
57494E41555448331677524F3BACE488BE0CCC0C3FC1C5CADCF297B61D8E1DE46210CB +50540D0D50FE7D2BDA03863EA461D5B7A9EBDFC10166BF4027E50E26B4521B8F946A2 +F0739887FE6275F887CACFC3DDC2CDB674DB7C8E74A5FE9A4BBC0004969924B91E01D +8E0A3DDF6B0000CE0EE40848CB9055C4F2BA34C7946AF1248FF8D8AB5F3D0041CE01F +F010EBDE26634A120988D1A195216584A1C376B560B44B9FDD9EC0163B1401EAC2140 +6B3CE570AD1F92F8D5DE8D1BDF75DEA44D840C67018A77075FF871520CF943258F8FC +FB67366922EE6D142F1049F8EE632CBFB48FADFD06EAA0C321D065234E5AD28BEF104 +9F8EE632CBFB48FADFD06EAA0C3265216B366CA45AB19DE78B216AACCD87CE323D747 +F904DF08D2F9C3832A138060F39F9D569BAAE05D11D31F9B9F3C7865EAC59BED4734F +3A3E0401A4809E482F6267E456C0444B21C81E58CFDDCF4D56E2DECB097FDB2E5B6A0 +27E78AE148575971CBC8705F2C2D0CDCCF9AA3D2253DD6F436AE1FF5208AF81E80C97 +7EBD473DB30F4F2B3E3666257D992FE2DFA9873002EF8932F042C5827966527AD7586 +8F22D22B872D9F89248FB0B090C747C1A3E4F47ED409805B19E4E81DC7A3E29B46DC9 +CEC179E8E35D5FBDA79A500963ABC159A28F23274AB01A595D8CAAB58F890E44E9F89 +B8FB0210E67F38A5F62612BCD18122437CF87ECBCD1E4EBF164F137D0CB67BE515A40 +F35158A0E2B06CEC7BD4531757275F45109781FAC902FD881001C199A818A690F9D56 +664F75FABC598CC7BEA703A22D4160CD128B7741A98ACEDFA88FB797FA711CB9B5D51 +8B1203AD594F0428BC5CD80146037E5956B1E1027735C19F0D991641C94DD1629969A +B8160539B8791484F25C967E57562930EEAB49496A60B7D1EB980EE5156F66CFD0E39 +28CAD79A58E981D3469DC2D969917222A8412AEDB279C66006F9E8DBF1E8EF85AE9C1 +0F6ABD2EBFF16AD2875BCC791493BF260546B98E3E54297FD8CFAC6956716F9617121 +D9F0F12FDA110399C9C0DD4C08216772703816A7D1B5C9D5AD6EA6432957E7A73906D +1877956C88A7B77EFAE71B324AD3B336D75CEB61B0420F1546F6B43C9D00595F3A918 +9A50F0BAAB50B581D6BCBF2CEAB4DDE2EF786ABF0718C9D891F5AA48B20A70EB699A7 +299598C9BD1BD626209435018B5CD1662006481F3848F059152C4C4AE9C10F6ABD2EB +FF16AD2875BCC791490BA22E5C01EE96563D7EED948032EFB98C693826D284C88078A +059007397D8B4B72BB5E89E19DBEADB3002109B20FAE470E360A8E35D151035222F5B +1902569A38C6F3B634593ACBBF271A26F4E62DE9CA01D5F6DA0C89FE231C567F008CD +113B21427733E83329AAC51D79B73D74CB500B19FF93C1006CEC9CEC179E8E35D5F20 +1404503B05A762B8D18961877B6FA0AE9C10F6ABD2EBFFE1F55681A380044E2C69BC0 +A3ADFD9475C8BBC53D8FCBC9A044EF52F2C7603DE093A11E39EF97847AE54F25F0C6F +811D72B7A95ABECC22892008156F92A43DBDE74532339E3D1BDDD073ACA2BF7BE79B5 +44C85A11372333F916019B3855B693F4E9E63D33B065999B686601FAE26AF2EEC1DA8 +D87AA9257C5EC4F208BF06E7FB48A9F32B965BC0287F58E4324F88ECEFE0ED3E19B65 +9A10B9EE01E2C40E8EA983C2D78F3188361FFA7D6D728193FA44F414CE5980CDA1F4C +D6712DA2CA4B4451691C6CFF92EF9D4335E442528A84C3CB747CDC3AC8C287F769868 +A05B3A9C6814B74E789F3F5E938E6DFFF0B0C1FB6498C693826D284C880EADA2849B3 +56CBEEDDE2EF786ABF0718EA2C0E2C8520BA110763EFE92839CB3232956B427AB43DB +74C08216772703816A7D1B5C9D5AD6EA6432957E7A73906D1877956C88A7B77EFAE71 +B324AD3B336D2968AF689FE6D93A82C0BC0A5FC776C6A9189A50F0BAAB50B581D6BCB +F2CEAB4DDE2EF786ABF0718C9D891F5AA48B20A316747F0F55C45905E0E2510A5F5A8 +FE18B5CD1662006481F3848F059152C4C4AE9C10F6ABD2EBFF16AD2875BCC791490BA +22E5C01EE96563D7EED948032EFB98C693826D284C88003B596A4411C5C202BF58174 +0F45C6D0 [download]	[reply] [d/l]