Azih has asked for the wisdom of the Perl Monks concerning the following question:

Hi I'm reading in a text file with characters like é in a Windows XP machine using ActivePerl the problem is that ActivePerl when reading the file in doesn't do what I expect

This output is coming into a MSDOS command prompt window

The code I'm using is this:

open(PARTY_FILE, "../Data/test.txt"); while ($file_line = <PARTY_FILE>) { print "$file_line\n"; }
The file test.txt consists of this line.
Québécois
IF test.txt is encoded in ANSI then my code spits out this output:

QuΘbΘcois

IF test.txt is encoded in UTF-8 my code spits out this output:

Québécois

IF test.txt is encoded in Unicode my code spits out this output:

■Q u Θ b Θ c o i s

Code tags added by Arunbear; (update: removed code tags around output)

Replies are listed 'Best First'.
Re: Reading text file with French characters
by graff (Chancellor) on Oct 15, 2006 at 19:54 UTC
    First, I'm impressed that you were able to convey the display contents of the MSDOS-Prompt window -- thanks for that.

    (Update: After code tags were added to "tidy things up", it seems the nice DOS glyphs are gone. Too bad... maybe the janitors can restore the earlier form, which I thought was quite clear.) (thanks, Arunbear!)

    Second, in order to display your text correctly in the MSDOS-Prompt window, the encoding you need to use is the one called cp437. Just convert your text to that encoding, and it should look just fine.

    It seems like you have a good understanding of what it means to convert text data to different encodings for output, and your different renderings of "Québécois" make sense, given that they are being viewed with a cp437-based display tool.

    For ISO-8859-1, CP1252 and Unicode, the numeric code for "é" is 0xE9. When expressed in UTF16-LE, that becomes the two-byte sequence "\xE9\x00" (the 16-bit value 0x00E9, low-byte first); when converted to UTF8, it becomes the two-byte sequence "\xC3\xA9" (perlunicode explains why this is so, in the section titled "Unicode Encodings", about halfway or so down).

    Also, your conversions to unicode have caused the "byte-order mark" (BOM) to be included at the beginning of the string. The BOM is code-point OxFEFF; in UTF16LE, that's "\xFF\xFE", and in utf8, it's "\xEF\xBB\xBF".

    You can look up those various byte values in the mapping table for cp437: http://www.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/PC/CP437.TXT
    and you'll understand why those encodings of the word look the way they do in the MSDOS-Prompt window. (Note: that window tends to display null bytes as spaces.)

Re: Reading text file with French characters
by ForgotPasswordAgain (Vicar) on Oct 15, 2006 at 16:09 UTC

    You forgot to show an example of code you're using.

    See perldoc PerlIO for how to open a file with whatever encoding.

Re: Reading text file with French characters
by Joost (Canon) on Oct 15, 2006 at 16:25 UTC