How to handle unicode txt file on Windows

cheerful has asked for the wisdom of the Perl Monks concerning the following question:

They starts with the byte order market FFFE or FEFF. I tried something like this:

        my $fh = new FileHandle("< $file");
        if (! $fh) {
        die "failed to open list file '$file': $!";
        }
        my $marker;
        if (2 != read($fh, $marker, 2)) {
        die "Failed to read the first 2 bytes from $file";
        }
        if ($marker eq $UNICODE_FFFE) {
        binmode($fh, ":encoding(utf8)");
        }
        else {
        $fh->seek(0, 0);
        }
[download]

But the following read

$line = <$fh>;
[download]

still generates a lot of error

print $line will produces letter alternating with space.

The script deals with just ascii text.

1. What's the proper to detect unicode in file?

2. How do I deal with unicode string in regular expression matching?

3. Do I need to convert unicode to non-unicode string to do string operation incl. matching? If so, what's the way to do that?

Comment on How to handle unicode txt file on Windows Select or Download Code

Replies are listed 'Best First'.
Re: How to handle unicode txt file on Windows by almut (Canon) on Nov 03, 2008 at 17:13 UTC
`FFFE` is the marker for UTF-16LE, not UTF-8... (so, `":encoding(UTF-16LE)"` might work better)	[reply] [d/l] [select]
Re^2: How to handle unicode txt file on Windows by ikegami (Patriarch) on Nov 03, 2008 at 20:38 UTC
And `:encoding(UTF-16)` will work even better since it absorbs the BOM.	[reply] [d/l]
Re^3: How to handle unicode txt file on Windows by cheerful (Initiate) on Nov 03, 2008 at 21:36 UTC
What would happen if it's called on a non-unicode file?	[reply]
Re^4: How to handle unicode txt file on Windows by ikegami (Patriarch) on Nov 03, 2008 at 21:51 UTC
Re: How to handle unicode txt file on Windows by ig (Vicar) on Nov 03, 2008 at 17:47 UTC
You might have a look at perlunitut: Unicode in Perl. Update: and UTF-8 text files with Byte Order Mark has some good pointers.	[reply]
Re: How to handle unicode txt file on Windows by jethro (Monsignor) on Nov 03, 2008 at 17:26 UTC
To answer 2. and 3.: You don't. Just use the strings, no matter where they came from or what format they are. The only times you have to do something special is when reading or writing files (and in the case you write the script itself in utf format). As soon as a string is "inside" perl, you can forget about its encoding.	[reply]
Re: How to handle unicode txt file on Windows by jplindstrom (Monsignor) on Nov 04, 2008 at 17:49 UTC
Look at File::BOM. /J	[reply]