Parsing UTF-16LE CSV Records Using Text::CSV*

Jim has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: Parsing UTF-16LE CSV Records Using Text::CSV* (TAMWTDI) by tye (Sage) on Jul 20, 2009 at 05:51 UTC
One first step would be to convert from UTF-16LE into a Perl string of characters (which would mean UTF-8 since Latin-1 may not support all of the characters you need). One way to do that would be the Encode module. Another way would be to use an "Input Layer" (see perlunicode for information on converting character encodings using a Layer and lots of other stuff about Unicode in Perl). A third way to do that would be something you likely won't find documentation on, so I'll roll the code for you: `binmode( CSV ); sysread( CSV, $bom, 2 ); $/= pack "v", unpack "c", "\n"; while( <CSV> ) { $_= pack "U", unpack "v", $_; # Parse the data that is now in UTF-8 }` [download] But you should probably use a real Unicode-conversion solution like one of the prior two (especially because each new version of Perl changes things related to Unicode, usually changing something so that it now sometimes silently does something different than it used to). A different first step would be to solve everything in terms of Perl byte strings. That is probably quite straight-forward. Just tell Text::CSV to use "binary" mode and give it the separators and such that you already figured out as Perl strings of bytes (it will likely use Text::CSV_XS under the covers). - tye	[reply] [d/l]
Re^2: Parsing UTF-16LE CSV Records Using Text::CSV* (RYO) by tye (Sage) on Jul 20, 2009 at 16:00 UTC
Oops, ikegami points out that the quote, separator, and escape sequences need to be single characters (single bytes?). I suspect they can each be a single UTF-8 character in a modern version of Text::CSV_XS, but I'm not going to verify that. If not, none of my approaches get you to the finish line. Luckily, parsing a particular example of CSV isn't rocket surgery (sure, please use the module when the module works, as, particularly with this module, it likely already knows some the edge cases you didn't think of right-off and has already been debugged). my $w= "\s\0"; # Whitespace my $s= "\x14\0"; # Separator my $q= "\xFE\0"; # Quote / escape my $c= ".."; # Character # Versions we can apply modifiers to: my $W= "(?:$w)"; my $S= "(?:$s)"; my $Q= "(?:$q)"; my $C= "(?:$c)"; $/= "\r\0\n\0"; while( <CSV> ) { my @row; my $end= 'none'; while( m{ \G # Don't skip any characters $W(?!$w) # Ignore whitespace outside quotes (?: # Either quoted value of bare valu +e: $q( # Opening quote, keep quoted value + as $1 (?: $q$q # An escaped quote \| (?!$q)$c # Or a non-quote ) # Zero or more characters per valu +e )$q(?!$q) # Closing quote (not escaped) \| ( # Capture the bare value as $2 (?: (?!$s\|$q)$c # A non-separator, non-quote chara +cter )+ # One or more 'bare' characters )(?<!$w) # Don't backtrack \| ( ) # An empty field as $3 ) # Separate "empty field" case required by prior (?<!$w +) $W* ( $s \| \z ) # Track field terminator in $4 }xsg ) { $end= $4; my( $quoted, $bare, $empty )= ( $1, $2 ); if( defined $empty ) { push @row, undef; # Or whatever you want 'empty' to mean } elsif( defined $quoted ) { $quoted =~ s/$q$q/$q/g; push @row, $quoted; } else { push @row, $bare; } } if( '' ne $end ) { # We didn't reach \z warn "Skipping malformed line ($.).\n"; next; } # Do something with @row, which has values as array of byte string +s } [download] Note that I haven't debugged this (nor tested it). You'll have to do some work yourself. ;) But the approach seems both sound and robust, IMHO. If your particular CSV doesn't meet my expections, then adjust the code to suit (since you bothered to mention both "quote" and "escape" characters, it sounds like your CSV will be well-formed and thus can be parsed by code very much like what I have provided). This approach might even have some advantages speed-wise over other approaches in that dealing with the variable-sized characters of UTF-8 makes for more complex (slower) code (of course, it isn't XS code either, but then, I value something working over speed by a long shot). Updated code slightly a few times. Then updated the regex to prevent some backtrack cases that would be a waste of time in the case of a malformed line and code to detect a malformed line. - tye	[reply] [d/l]
Re: Parsing UTF-16LE CSV Records Using Text::CSV* by graff (Chancellor) on Jul 20, 2009 at 04:11 UTC
I would strongly suggest that you convert the utf-16 data to perl-internal utf8 before doing anything else with it. You can convert it back to utf-16 again for output if necessary, but there's no reason to use utf-16 encoding for character-based processing. Just set the encoding layer on the input file handle to `":encoding(UTF-16LE)"` and perl will convert the data to utf8 as it reads from the file. Do the same for the output file handle.	[reply] [d/l]
Re^2: Parsing UTF-16LE CSV Records Using Text::CSV* by ikegami (Patriarch) on Jul 20, 2009 at 07:35 UTC
According to the documentation, that won't work. Both the escape char and the quote char are limited to a single-byte character. U+00FE and U+00FE are not single byte characters in "perl-internal utf8". Boo for equating the string's internal encoding with whether it's been decoded or not. See Re: Decoding, Encoding string, how to? (internal encoding).	[reply]
Re^3: Parsing UTF-16LE CSV Records Using Text::CSV* by graff (Chancellor) on Jul 20, 2009 at 13:52 UTC
Boo for equating the string's internal encoding with whether it's been decoded or not. If you "decode()" a non-ascii, non-utf8 string (or if it passes through a decoding IO layer on input), and the operation is successful, the string value returned by decode() has the utf8 flag on, and you get character semantics (not byte semantics) when doing stuff with that string -- that's the point of using "decode()" and the encoding IO layer, and that's all I was talking about in my suggestion. (My reply may well have been less than fully helpful for other reasons.) As for U+00FE, perhaps I'm just behind the times, not having taken time to explore all the details of 5.10 yet. In 5.8.8, the "perl-internal utf8" storage of characters in the rang 0x80-0xFF is single-byte. They would be converted to multi-byte on output to a utf8-mode file handle. I don't recall at the moment what particular operations are sensitive to (or would reveal) this distinction, but it's there.	[reply]
Re^4: Parsing UTF-16LE CSV Records Using Text::CSV* by ikegami (Patriarch) on Jul 20, 2009 at 16:21 UTC
Re^5: Parsing UTF-16LE CSV Records Using Text::CSV* (5.10) by tye (Sage) on Jul 20, 2009 at 18:42 UTC
Some notes below your chosen depth have not been shown here
Re: Parsing UTF-16LE CSV Records Using Text::CSV* by Anonymous Monk on Jul 20, 2009 at 04:15 UTC
`#!/usr/bin/perl -- use strict; use warnings; { use Text::CSV::Encoded; use autodie 2.06; use open ':encoding(UTF-16LE)'; my $csv = Text::CSV::Encoded->new ({ encoding_in => "UTF-16LE", encoding_out => "UTF-16LE", }); open (my $in, "input.csv"); open (my $out, "output.csv"); while( my $columns = $csv->getline( $in ) ) { $csv->print( $out, $columns ); } }` [download]	[reply] [d/l]