It's already been covered that it should be
open my $fh, "<:raw:encoding(utf-16):crlf:utf8", $file or die "$file: $!";
or more precisely,
open my $fh, "<:raw:encoding(ucs-2le):crlf:utf8", $file or die "$file: $!"; read($fh, my $bom='', 1);
And no, it doesn't work. Not if the data contains any non-ASCII characters, at least, but that's the whole point of this exercise. The UTF8 flag gets turned off, so the UTF-8 encoding of the characters is treated as iso-latin-1.
For example, if a field contains <"é">, Text::CSV_XS returns the two characters <é> instead of <é>. (I'm using angled brackets to quote to avoid confusion with the double-quotes in the CSV file.)
For example, if a field contains <"♠">, Text::CSV_XS returns the three characters <â£> instead of <♠>.
The flag needs to be reinstated, so it should be:
use Encode qw( _utf8_on ); my $csv = Text::CSV_XS->new ({ binary => 1 }); # UTF-16 or UCS-2 file with BOM and CRLF or LF line endings. open my $fh, "<:raw:encoding(utf-16):crlf:utf8", $file or die "$file: $!"; while (my $row = $csv->getline ($fh)) { # Fix inability of CSV_XS to handle UTF8 strings. _utf8_on($_) for @$row; print $row->[4]; }
There is at least one other problem with treating characters encoded using UTF-8 no differently then characters encoded using iso-latin-1 as Text::CSV_XS does.
If any of eol, sep_char, etc is passed a string with the UTF8 flag off and it contains a character in [\x80-\xFF], Text::CSV_XS can generate false positives. However, this is unlikely to ever happen.
Text::CSV might soon be extended with a layer that deals with encodings
I don't see the point, since Text::CSV doesn't open any file handles. All it needs to do is respect the UTF8 flag on strings it receives via getline, eol, sep_char, etc. Currently (well, 0.34 and presumably 0.45), it ignores it.
In reply to Re^3: CSV nightmare (utf8 w/ csv_xs)
by ikegami
in thread CSV nightmare
by lorenzov
| For: | Use: | ||
| & | & | ||
| < | < | ||
| > | > | ||
| [ | [ | ||
| ] | ] |