Re: Handling variety of languages/Unicode characters with Spreadsheet::ParseExcel

Replies are listed 'Best First'.
Re^2: Handling variety of languages/Unicode characters with Spreadsheet::ParseExcel by richb (Scribe) on Apr 09, 2010 at 12:13 UTC
Thanks for the suggestion. I created a small test spreadsheet with two entries: Fundación ФОРСУНОК The Encoding method returns 1 (8bit ASCII or single byte UTF-16) for the Spanish text and 2 (UTF-16BE) for the Russian text. I also modified the TextFmt routine in FmtDefault.pm to print the value of the parameter $sCode. It was undef for the Spanish text and UTF16-BE for the Russian text. So the routine just returns the Spanish text since $sCode is undef, but formats the Russian text (which gets mangled) as UTF16-BE. sub TextFmt($$;$) { my($oThis, $sTxt, $sCode) =@_; if((! defined($sCode)) \|\| ($sCode eq '_native_')) { print STDERR "$sTxt/sCode " . (defined($sCode) ? "is _native_" + : "undefined") . " - returning text\n"; return $sTxt; }; # Handle utf8 strings in newer perls. if ($] >= 5.008) { require Encode; print STDERR "$sTxt/$sCode; returning text with UTF-16BE encod +ing\n"; return Encode::decode("UTF-16BE", $sTxt); } print STDERR "$sTxt/$sCode; formatting with pack/unpack\n"; return pack('U', unpack('n', $sTxt)); #return pack('C', unpack('n', $sTxt)); } [download]	[reply] [d/l]
Re^3: Handling variety of languages/Unicode characters with Spreadsheet::ParseExcel by graff (Chancellor) on Apr 09, 2010 at 13:30 UTC
The Encoding method returns 1 (8bit ASCII or single byte UTF-16) for the Spanish text I don't know where that description is coming from, but it suggests a serious misunderstanding of the terms being used. You can say "single-byte ASCII" (which is redundant, since ASCII by definition uses only 7 bits), but it's strange to say "8bit ASCII", because ASCII does not refer to values in the 0x80-0xFF range, and people usually speak of "8-bit characters" as being in contrast to ASCII (because 8-bit characters are the ones in the range 0x80-0xFF). Saying "single byte UTF16" is simply nonsensical. It's an oxymoron.	[reply]
Re^4: Handling variety of languages/Unicode characters with Spreadsheet::ParseExcel by richb (Scribe) on Apr 09, 2010 at 14:26 UTC
I copied the text in parentheses from the page you linked to, which explains the encoding method's return values: The encoding() method returns one of the following values: * 0: Unknown format. This shouldn't happen. In the default case th +e format should be 1. * 1: 8bit ASCII or single byte UTF-16. This indicates that the cha +racters are encoded in a single byte. In Excel 95 and earlier This us +ually meant ASCII or an international variant. In Excel 97 it refers +to a compressed UTF-16 character string where all of the high order b +ytes are 0 and are omitted to save space. * 2: UTF-16BE. * 3: Native encoding. In Excel 95 and earlier this encoding was us +ed to represent multi-byte character encodings such as SJIS. [download]	[reply] [d/l]
Re^5: Handling variety of languages/Unicode characters with Spreadsheet::ParseExcel by graff (Chancellor) on Apr 09, 2010 at 14:46 UTC