Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

This node falls below the community's threshold of quality. You may see it by logging in.

Replies are listed 'Best First'.
Re: encoding and module
by Joost (Canon) on Jun 09, 2006 at 10:30 UTC
Re: encoding and module
by badaiaqrandista (Pilgrim) on Jun 09, 2006 at 06:01 UTC
Re: encoding and module
by graff (Chancellor) on Jun 10, 2006 at 00:33 UTC
    Single-byte character codes in the range \x80-\x9f are not used for printable characters in any of the ISO-8859 sets (Latin, Greek, Cyrillic, Arabic, Hebrew), and since unicode respects ISO-8859, this range is "unprintable" when converted directly to unicode. (update: by that I mean, if you just extended these to 16-bit values by adding a null high byte)

    This range tends to be used for miscellaneous printable stuff by the various Microsoft code pages. These codes tend to be used for the same set of miscellaneous characters in all the Microsoft CP125n code pages (n=0..8) -- things like special quote characters and symbols; but earlier MS-DOS code pages (CP8..) tend to use the range in different ways.

    So you need to know something about where the data are coming from in order to know what do with characters in this range. The standard installation for the Encode module will handle all the DOS/Windows code pages, so if the data are CP125* (which is likely), just pick any of those as the "legacy" encoding in order to convert correctly to unicode.