Re^4: Handling variety of languages/Unicode characters with Spreadsheet::ParseExcel

I copied the text in parentheses from the page you linked to, which explains the encoding method's return values:

The encoding() method returns one of the following values:

    * 0: Unknown format. This shouldn't happen. In the default case th
+e format should be 1.
    * 1: 8bit ASCII or single byte UTF-16. This indicates that the cha
+racters are encoded in a single byte. In Excel 95 and earlier This us
+ually meant ASCII or an international variant. In Excel 97 it refers 
+to a compressed UTF-16 character string where all of the high order b
+ytes are 0 and are omitted to save space.
    * 2: UTF-16BE.
    * 3: Native encoding. In Excel 95 and earlier this encoding was us
+ed to represent multi-byte character encodings such as SJIS.
[download]

Comment on Re^4: Handling variety of languages/Unicode characters with Spreadsheet::ParseExcel Download Code

Replies are listed 'Best First'.
Re^5: Handling variety of languages/Unicode characters with Spreadsheet::ParseExcel by graff (Chancellor) on Apr 09, 2010 at 14:46 UTC
I copied the text in parentheses from the page you linked to You mean the page that eff_i_g linked to... that's good to know. The full explanation that you quoted for the "* 1." return value is indeed pretty silly. Better to describe that as "single-byte encoding (typically cp1252, but possibly some other single-byte code page)". The notion of a "compressed UTF-16 character string where all of the high order bytes are 0 and are omitted to save space" is nonsensical, but if it made any sense, it would actually be referring to "ISO-8859-1", but since Excel is a M$ product, I would expect this to be cp1252 (or similar) instead, using code points in the 0x80-0x9f range for various punctuations marks, etc (whereas ISO-8859 has them as "special control characters" that are all non-displayable).	[reply]

Replies are listed 'Best First'.

Re^5: Handling variety of languages/Unicode characters with Spreadsheet::ParseExcel
by graff (Chancellor) on Apr 09, 2010 at 14:46 UTC

I copied the text in parentheses from the page you linked to

You mean the page that eff_i_g linked to... that's good to know. The full explanation that you quoted for the "* 1." return value is indeed pretty silly. Better to describe that as "single-byte encoding (typically cp1252, but possibly some other single-byte code page)".

The notion of a "compressed UTF-16 character string where all of the high order bytes are 0 and are omitted to save space" is nonsensical, but if it made any sense, it would actually be referring to "ISO-8859-1", but since Excel is a M$ product, I would expect this to be cp1252 (or similar) instead, using code points in the 0x80-0x9f range for various punctuations marks, etc (whereas ISO-8859 has them as "special control characters" that are all non-displayable).

[reply]