in reply to Text File Encoding under Windows

When I print the file in the console, all characters appear separated by a strange extra whitespace.

The file is most likely encoded as UTF-16 (or UCS-2, which for most practical purposes doesn't make much of a difference).  Try to open it with

open my $fh, "<:encoding(UTF-16LE)", ... while (<$fh>) {

( :encoding(UTF-16) should work, too, if the file has a BOM (byte order mark), which it typically has. In this case, the BOM itself (\x{feff}) also won't be part of the data read via <$fh>. )

Replies are listed 'Best First'.
Re^2: Text File Encoding under Windows
by pat_mc (Pilgrim) on Mar 18, 2010 at 08:33 UTC
    Thanks, almut -

    This solved part of the problem. The file is now getting read in OK and the expected regex matches occur. However, the output is still causing problem because the non-ASCII characters in it do not get represented correctly. I have tried the following two approaches:

    1) Printing out to the DOS console and redirecting the output from there into a file. The result looks fine - if it were not for the special characters that get represented as EF,FC etc.

    2) Printing to a UTF-16 encoded file with the following code:
    #! /usr/bin/perl -w use strict; use locale; open INPUT, "<:encoding(UTF-16LE)", $ARGV[0]; open OUTPUT, ">:encoding(UTF-16LE)", "./Output_UTF-16"; while ( <INPUT> ) { # long list of regex-based replacements print OUTPUT $_; }
    The result was an output file which represented all special characters correctly but contained a line of empty boxes in every second line.

    Can you please advise what I need to do to fix both output variants?

    Thanks again for your help!
    Pat

      What is the desired (or required) output encoding, i.e. which program are you using to view or further process the output? (can it handle UTF-16?)

      What special characters are involved; may they also be represented in a non-unicode legacy encoding such as ISO-8859-1 (ISO-Latin1) or Windows CP1252?

      Maybe just try other output encodings

      open OUTPUT, ">:encoding(UTF-8)", ... open OUTPUT, ">:encoding(CP1252)", ... open OUTPUT, ">:encoding(ISO-8859-1)"... ...

      The latter two should be used in combination with :encoding(UTF-16) on the input side, because that would swallow the BOM, which you don't want in non-unicode output  (in case of UTF-8 the BOM is optional, so you can decide for yourself).

      P.S.: can you view the original input file correctly with the same program that's showing the empty boxes with the output file?  (btw, do you really mean "line" in "a line of empty boxes in every second line", or rather character "column"? — the former would be kinda strange...)

        Thanks, almut -

        This was very helpful! I managed to find out that UTF-8 output encoding in fact worked fine and all the special characters displayed correctly. The application operating on the modified files (of which I did not know which encoding it required) accepted the input thus created.

        Thanks again for your help! Problem resolved.

        Cheers -

        Pat