Oops, ikegami points out that the quote, separator, and escape sequences need to be single characters (single bytes?). I suspect they can each be a single UTF-8 character in a modern version of Text::CSV_XS, but I'm not going to verify that. If not, none of my approaches get you to the finish line.
Luckily, parsing a particular example of CSV isn't rocket surgery (sure, please use the module when the module works, as, particularly with this module, it likely already knows some the edge cases you didn't think of right-off and has already been debugged).
my $w= "\s\0"; # Whitespace my $s= "\x14\0"; # Separator my $q= "\xFE\0"; # Quote / escape my $c= ".."; # Character # Versions we can apply modifiers to: my $W= "(?:$w)"; my $S= "(?:$s)"; my $Q= "(?:$q)"; my $C= "(?:$c)"; $/= "\r\0\n\0"; while( <CSV> ) { my @row; my $end= 'none'; while( m{ \G # Don't skip any characters $W*(?!$w) # Ignore whitespace outside quotes (?: # Either quoted value of bare valu +e: $q( # Opening quote, keep quoted value + as $1 (?: $q$q # An escaped quote | (?!$q)$c # Or a non-quote )* # Zero or more characters per valu +e )$q(?!$q) # Closing quote (not escaped) | ( # Capture the bare value as $2 (?: (?!$s|$q)$c # A non-separator, non-quote chara +cter )+ # One or more 'bare' characters )(?<!$w) # Don't backtrack | ( ) # An empty field as $3 ) # Separate "empty field" case required by prior (?<!$w +) $W* ( $s | \z ) # Track field terminator in $4 }xsg ) { $end= $4; my( $quoted, $bare, $empty )= ( $1, $2 ); if( defined $empty ) { push @row, undef; # Or whatever you want 'empty' to mean } elsif( defined $quoted ) { $quoted =~ s/$q$q/$q/g; push @row, $quoted; } else { push @row, $bare; } } if( '' ne $end ) { # We didn't reach \z warn "Skipping malformed line ($.).\n"; next; } # Do something with @row, which has values as array of byte string +s }
Note that I haven't debugged this (nor tested it). You'll have to do some work yourself. ;) But the approach seems both sound and robust, IMHO. If your particular CSV doesn't meet my expections, then adjust the code to suit (since you bothered to mention both "quote" and "escape" characters, it sounds like your CSV will be well-formed and thus can be parsed by code very much like what I have provided).
This approach might even have some advantages speed-wise over other approaches in that dealing with the variable-sized characters of UTF-8 makes for more complex (slower) code (of course, it isn't XS code either, but then, I value something working over speed by a long shot).
Updated code slightly a few times. Then updated the regex to prevent some backtrack cases that would be a waste of time in the case of a malformed line and code to detect a malformed line.
- tye
In reply to Re^2: Parsing UTF-16LE CSV Records Using Text::CSV* (RYO)
by tye
in thread Parsing UTF-16LE CSV Records Using Text::CSV*
by Jim
| For: | Use: | ||
| & | & | ||
| < | < | ||
| > | > | ||
| [ | [ | ||
| ] | ] |