Re: Parsing UTF-16LE CSV Records Using Text::CSV* (TAMWTDI)

One first step would be to convert from UTF-16LE into a Perl string of characters (which would mean UTF-8 since Latin-1 may not support all of the characters you need). One way to do that would be the Encode module. Another way would be to use an "Input Layer" (see perlunicode for information on converting character encodings using a Layer and lots of other stuff about Unicode in Perl).

A third way to do that would be something you likely won't find documentation on, so I'll roll the code for you:

    binmode( CSV );
    sysread( CSV, $bom, 2 );
    $/= pack "v", unpack "c", "\n";
    while(  <CSV>  ) {
        $_= pack "U*", unpack "v*", $_;
        # Parse the data that is now in UTF-8
    }
[download]

But you should probably use a real Unicode-conversion solution like one of the prior two (especially because each new version of Perl changes things related to Unicode, usually changing something so that it now sometimes silently does something different than it used to).

A different first step would be to solve everything in terms of Perl byte strings. That is probably quite straight-forward. Just tell Text::CSV to use "binary" mode and give it the separators and such that you already figured out as Perl strings of bytes (it will likely use Text::CSV_XS under the covers).

- tye

Comment on Re: Parsing UTF-16LE CSV Records Using Text::CSV* (TAMWTDI) Download Code

Replies are listed 'Best First'.
Re^2: Parsing UTF-16LE CSV Records Using Text::CSV* (RYO) by tye (Sage) on Jul 20, 2009 at 16:00 UTC
Oops, ikegami points out that the quote, separator, and escape sequences need to be single characters (single bytes?). I suspect they can each be a single UTF-8 character in a modern version of Text::CSV_XS, but I'm not going to verify that. If not, none of my approaches get you to the finish line. Luckily, parsing a particular example of CSV isn't rocket surgery (sure, please use the module when the module works, as, particularly with this module, it likely already knows some the edge cases you didn't think of right-off and has already been debugged). my $w= "\s\0"; # Whitespace my $s= "\x14\0"; # Separator my $q= "\xFE\0"; # Quote / escape my $c= ".."; # Character # Versions we can apply modifiers to: my $W= "(?:$w)"; my $S= "(?:$s)"; my $Q= "(?:$q)"; my $C= "(?:$c)"; $/= "\r\0\n\0"; while( <CSV> ) { my @row; my $end= 'none'; while( m{ \G # Don't skip any characters $W(?!$w) # Ignore whitespace outside quotes (?: # Either quoted value of bare valu +e: $q( # Opening quote, keep quoted value + as $1 (?: $q$q # An escaped quote \| (?!$q)$c # Or a non-quote ) # Zero or more characters per valu +e )$q(?!$q) # Closing quote (not escaped) \| ( # Capture the bare value as $2 (?: (?!$s\|$q)$c # A non-separator, non-quote chara +cter )+ # One or more 'bare' characters )(?<!$w) # Don't backtrack \| ( ) # An empty field as $3 ) # Separate "empty field" case required by prior (?<!$w +) $W* ( $s \| \z ) # Track field terminator in $4 }xsg ) { $end= $4; my( $quoted, $bare, $empty )= ( $1, $2 ); if( defined $empty ) { push @row, undef; # Or whatever you want 'empty' to mean } elsif( defined $quoted ) { $quoted =~ s/$q$q/$q/g; push @row, $quoted; } else { push @row, $bare; } } if( '' ne $end ) { # We didn't reach \z warn "Skipping malformed line ($.).\n"; next; } # Do something with @row, which has values as array of byte string +s } [download] Note that I haven't debugged this (nor tested it). You'll have to do some work yourself. ;) But the approach seems both sound and robust, IMHO. If your particular CSV doesn't meet my expections, then adjust the code to suit (since you bothered to mention both "quote" and "escape" characters, it sounds like your CSV will be well-formed and thus can be parsed by code very much like what I have provided). This approach might even have some advantages speed-wise over other approaches in that dealing with the variable-sized characters of UTF-8 makes for more complex (slower) code (of course, it isn't XS code either, but then, I value something working over speed by a long shot). Updated code slightly a few times. Then updated the regex to prevent some backtrack cases that would be a waste of time in the case of a malformed line and code to detect a malformed line. - tye	[reply] [d/l]

Replies are listed 'Best First'.

Re^2: Parsing UTF-16LE CSV Records Using Text::CSV* (RYO)
by tye (Sage) on Jul 20, 2009 at 16:00 UTC

Oops, ikegami points out that the quote, separator, and escape sequences need to be single characters (single bytes?). I suspect they can each be a single UTF-8 character in a modern version of Text::CSV_XS, but I'm not going to verify that. If not, none of my approaches get you to the finish line.

Luckily, parsing a particular example of CSV isn't rocket surgery (sure, please use the module when the module works, as, particularly with this module, it likely already knows some the edge cases you didn't think of right-off and has already been debugged).

my $w= "\s\0";      # Whitespace
my $s= "\x14\0";    # Separator
my $q= "\xFE\0";    # Quote / escape
my $c= "..";        # Character
# Versions we can apply modifiers to:
my $W= "(?:$w)";
my $S= "(?:$s)";
my $Q= "(?:$q)";
my $C= "(?:$c)";

$/= "\r\0\n\0";
while(  <CSV>  ) {
    my @row;
    my $end= 'none';
    while(
        m{
            \G                      # Don't skip any characters
            $W*(?!$w)               # Ignore whitespace outside quotes
            (?:                     # Either quoted value of bare valu
+e:
                $q(                 # Opening quote, keep quoted value
+ as $1
                    (?:
                        $q$q        # An escaped quote
                      | (?!$q)$c    # Or a non-quote
                    )*              # Zero or more characters per valu
+e
                )$q(?!$q)           # Closing quote (not escaped)
              | (                   # Capture the bare value as $2
                    (?:
                        (?!$s|$q)$c # A non-separator, non-quote chara
+cter
                    )+              # One or more 'bare' characters
                )(?<!$w)            # Don't backtrack
              | ( )                 # An empty field as $3
            )   # Separate "empty field" case required by prior (?<!$w
+)
            $W*
            ( $s | \z )             # Track field terminator in $4
        }xsg
    ) {
        $end= $4;
        my( $quoted, $bare, $empty )= ( $1, $2 );
        if(  defined $empty  ) {
            push @row, undef;   # Or whatever you want 'empty' to mean
        } elsif(  defined $quoted  ) {
            $quoted =~ s/$q$q/$q/g;
            push @row, $quoted;
        } else {
            push @row, $bare;
        }
    }
    if(  '' ne $end  ) {    # We didn't reach \z
        warn "Skipping malformed line ($.).\n";
        next;
    }
    # Do something with @row, which has values as array of byte string
+s
}
[download]

Note that I haven't debugged this (nor tested it). You'll have to do some work yourself. ;) But the approach seems both sound and robust, IMHO. If your particular CSV doesn't meet my expections, then adjust the code to suit (since you bothered to mention both "quote" and "escape" characters, it sounds like your CSV will be well-formed and thus can be parsed by code very much like what I have provided).

This approach might even have some advantages speed-wise over other approaches in that dealing with the variable-sized characters of UTF-8 makes for more complex (slower) code (of course, it isn't XS code either, but then, I value something working over speed by a long shot).

Updated code slightly a few times. Then updated the regex to prevent some backtrack cases that would be a waste of time in the case of a malformed line and code to detect a malformed line.

- tye

[reply]
[d/l]