It's already been covered that it should be

open my $fh, "<:raw:encoding(utf-16):crlf:utf8", $file or die "$file: $!";

or more precisely,

open my $fh, "<:raw:encoding(ucs-2le):crlf:utf8", $file or die "$file: $!"; read($fh, my $bom='', 1);

And no, it doesn't work. Not if the data contains any non-ASCII characters, at least, but that's the whole point of this exercise. The UTF8 flag gets turned off, so the UTF-8 encoding of the characters is treated as iso-latin-1.

For example, if a field contains <"é">, Text::CSV_XS returns the two characters <é> instead of <é>. (I'm using angled brackets to quote to avoid confusion with the double-quotes in the CSV file.)

For example, if a field contains <"♠">, Text::CSV_XS returns the three characters <♣> instead of <♠>.

The flag needs to be reinstated, so it should be:

use Encode qw( _utf8_on ); my $csv = Text::CSV_XS->new ({ binary => 1 }); # UTF-16 or UCS-2 file with BOM and CRLF or LF line endings. open my $fh, "<:raw:encoding(utf-16):crlf:utf8", $file or die "$file: $!"; while (my $row = $csv->getline ($fh)) { # Fix inability of CSV_XS to handle UTF8 strings. _utf8_on($_) for @$row; print $row->[4]; }

There is at least one other problem with treating characters encoded using UTF-8 no differently then characters encoded using iso-latin-1 as Text::CSV_XS does.

If any of eol, sep_char, etc is passed a string with the UTF8 flag off and it contains a character in [\x80-\xFF], Text::CSV_XS can generate false positives. However, this is unlikely to ever happen.

Text::CSV might soon be extended with a layer that deals with encodings

I don't see the point, since Text::CSV doesn't open any file handles. All it needs to do is respect the UTF8 flag on strings it receives via getline, eol, sep_char, etc. Currently (well, 0.34 and presumably 0.45), it ignores it.


In reply to Re^3: CSV nightmare (utf8 w/ csv_xs) by ikegami
in thread CSV nightmare by lorenzov

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.