in reply to Re^2: CSV nightmare
in thread CSV nightmare
It's already been covered that it should be
open my $fh, "<:raw:encoding(utf-16):crlf:utf8", $file or die "$file: $!";
or more precisely,
open my $fh, "<:raw:encoding(ucs-2le):crlf:utf8", $file or die "$file: $!"; read($fh, my $bom='', 1);
And no, it doesn't work. Not if the data contains any non-ASCII characters, at least, but that's the whole point of this exercise. The UTF8 flag gets turned off, so the UTF-8 encoding of the characters is treated as iso-latin-1.
For example, if a field contains <"é">, Text::CSV_XS returns the two characters <é> instead of <é>. (I'm using angled brackets to quote to avoid confusion with the double-quotes in the CSV file.)
For example, if a field contains <"♠">, Text::CSV_XS returns the three characters <â£> instead of <♠>.
The flag needs to be reinstated, so it should be:
use Encode qw( _utf8_on ); my $csv = Text::CSV_XS->new ({ binary => 1 }); # UTF-16 or UCS-2 file with BOM and CRLF or LF line endings. open my $fh, "<:raw:encoding(utf-16):crlf:utf8", $file or die "$file: $!"; while (my $row = $csv->getline ($fh)) { # Fix inability of CSV_XS to handle UTF8 strings. _utf8_on($_) for @$row; print $row->[4]; }
There is at least one other problem with treating characters encoded using UTF-8 no differently then characters encoded using iso-latin-1 as Text::CSV_XS does.
If any of eol, sep_char, etc is passed a string with the UTF8 flag off and it contains a character in [\x80-\xFF], Text::CSV_XS can generate false positives. However, this is unlikely to ever happen.
Text::CSV might soon be extended with a layer that deals with encodings
I don't see the point, since Text::CSV doesn't open any file handles. All it needs to do is respect the UTF8 flag on strings it receives via getline, eol, sep_char, etc. Currently (well, 0.34 and presumably 0.45), it ignores it.
|
|---|
| Replies are listed 'Best First'. | |
|---|---|
|
Re^4: CSV nightmare
by Tux (Canon) on Jun 03, 2008 at 15:10 UTC | |
by ikegami (Patriarch) on Jun 03, 2008 at 19:26 UTC | |
by ikegami (Patriarch) on Jun 04, 2008 at 23:19 UTC | |
by Tux (Canon) on Jun 05, 2008 at 05:33 UTC |