comment on

Oops, ikegami points out that the quote, separator, and escape sequences need to be single characters (single bytes?). I suspect they can each be a single UTF-8 character in a modern version of Text::CSV_XS, but I'm not going to verify that. If not, none of my approaches get you to the finish line.

Luckily, parsing a particular example of CSV isn't rocket surgery (sure, please use the module when the module works, as, particularly with this module, it likely already knows some the edge cases you didn't think of right-off and has already been debugged).

my $w= "\s\0";      # Whitespace
my $s= "\x14\0";    # Separator
my $q= "\xFE\0";    # Quote / escape
my $c= "..";        # Character
# Versions we can apply modifiers to:
my $W= "(?:$w)";
my $S= "(?:$s)";
my $Q= "(?:$q)";
my $C= "(?:$c)";

$/= "\r\0\n\0";
while(  <CSV>  ) {
    my @row;
    my $end= 'none';
    while(
        m{
            \G                      # Don't skip any characters
            $W*(?!$w)               # Ignore whitespace outside quotes
            (?:                     # Either quoted value of bare valu
+e:
                $q(                 # Opening quote, keep quoted value
+ as $1
                    (?:
                        $q$q        # An escaped quote
                      | (?!$q)$c    # Or a non-quote
                    )*              # Zero or more characters per valu
+e
                )$q(?!$q)           # Closing quote (not escaped)
              | (                   # Capture the bare value as $2
                    (?:
                        (?!$s|$q)$c # A non-separator, non-quote chara
+cter
                    )+              # One or more 'bare' characters
                )(?<!$w)            # Don't backtrack
              | ( )                 # An empty field as $3
            )   # Separate "empty field" case required by prior (?<!$w
+)
            $W*
            ( $s | \z )             # Track field terminator in $4
        }xsg
    ) {
        $end= $4;
        my( $quoted, $bare, $empty )= ( $1, $2 );
        if(  defined $empty  ) {
            push @row, undef;   # Or whatever you want 'empty' to mean
        } elsif(  defined $quoted  ) {
            $quoted =~ s/$q$q/$q/g;
            push @row, $quoted;
        } else {
            push @row, $bare;
        }
    }
    if(  '' ne $end  ) {    # We didn't reach \z
        warn "Skipping malformed line ($.).\n";
        next;
    }
    # Do something with @row, which has values as array of byte string
+s
}
[download]

Note that I haven't debugged this (nor tested it). You'll have to do some work yourself. ;) But the approach seems both sound and robust, IMHO. If your particular CSV doesn't meet my expections, then adjust the code to suit (since you bothered to mention both "quote" and "escape" characters, it sounds like your CSV will be well-formed and thus can be parsed by code very much like what I have provided).

This approach might even have some advantages speed-wise over other approaches in that dealing with the variable-sized characters of UTF-8 makes for more complex (slower) code (of course, it isn't XS code either, but then, I value something working over speed by a long shot).

Updated code slightly a few times. Then updated the regex to prevent some backtrack cases that would be a waste of time in the case of a malformed line and code to detect a malformed line.

- tye

In reply to Re^2: Parsing UTF-16LE CSV Records Using Text::CSV* (RYO) by tye
in thread Parsing UTF-16LE CSV Records Using Text::CSV* by Jim

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.