SineSwiper has asked for the wisdom of the Perl Monks concerning the following question:

I've written a CSV reader, which I think should handle every case of quoting pairs, etc. It will remove NULL and LINEFEED characters, but other than that, it should be a completely compliant CSV reader. I just want to know if there's a case where this wouldn't work:
&read_csv; sub read_csv { my @csv_lines = ('begin,middle,end'."\n", '"This is a ""test""",middle,end'."\n", '"""test"""", asd",middle,end'."\n", '""","""",""",","",",end'."\n", '"Multi-'."\n", 'line",middle,end'."\n"); my $prev_line; foreach my $line (@csv_lines) { # Normally, this would be a while( +my $line = <CSV>) loop print "CSVL: $line"; $line =~ s/[\0\r]//g; $line = $prev_line.$line; # Parse quoted stuff while ($line =~ /(?<!\")(\")((?:\"\"|[^\"]+)+?)(\"|\n)(?!\")/g +cs) { my ($endpos, $starter, $token, $ender) = (pos($line), $1, +$2, $3); my $oldlen = length($starter.$token.$ender); my $startpos = $endpos - $oldlen; print "Token: $token ==> "; $token =~ s/\"\"/\0QUOTE\0/g; $token =~ s/,/\0COMMA\0/g; $token =~ s/\n/\0NEWLINE\0/g; $ender =~ s/\n/\0NEWLINE\0/g; print "$token\n"; substr($line, $startpos, $oldlen, ($starter.$token.$ender) +); pos($line) = $endpos + (length($starter.$token.$ender) - $ +oldlen); print "Line: $line\n"; } # Runaway multi-line; grab the next line and try it again if ($line =~ /\0NEWLINE\0$/) { $prev_line = $line; next; } undef $prev_line; $line =~ s/[\"\n]//g; # Remove non-data # Re-translate masked characters my @values = split(/,/, $line); foreach $value (@values) { $value =~ s/\0QUOTE\0/\"/g; $value =~ s/\0COMMA\0/\,/g; #$value =~ s/\0NEWLINE\0/\n/g; $value =~ s/\0NEWLINE\0/\\n/g; # for testing, use literal '\ +n' } print "Values: ".join(' | ', @values)."\n"; } }

Replies are listed 'Best First'.
Re: Successful CSV reader?
by Limbic~Region (Chancellor) on Apr 26, 2005 at 12:33 UTC
    SineSwiper,
    To verify your code gets passing grades, you should use the test suites of modules that are performing similar tasks: I believe the first one, written by tilly, works in cases where the second two break so you may want to consider using it regardless of your code passing all the test suites as it is a bit more mature.

    Cheers - L~R

      works in cases where the second two break

      Text::xSV is an excellent module and handles more than Text::CSV but I wish you and dragonchild would quit spreading FUD about Text::CSV_XS. It has an much uglier interface than xSV and xSV does things not related to the actual parsing that CSV_XS does not do (and likewise Text::CSV_XS does non-parsing things that xSV does not). But when it comes to actually parsing CSV, the two modules are equivaqlent in what they can handle except for a few very unusual edge cases that CSV_XS handles that xSV does not. If you can find some text that works with xSV and does not work with CSV_XS, post it. Otherwise, refer to Benchmark comparison of Text::xSV and Text::CSV_XS and Comparison of the parsing features of CSV (and xSV) modules. If you like Text::xSV, great, recommend it to everyone on the planet (I often do), but there's no need to denigrate other modules because you've found one you like.

        There is actually one edge parsing case that Text::xSV handles differently than Text::CSV_XS where I think that Text::xSV's behaviour is more convenient. (There being no spec, it is impossible to say which is right.) In Text::xSV if you have ,, the embedded field is going to be undef, while if you have ,"", the embedded field is an empty string. (This distinction is drawn on writing as well.) Both of those come out as an empty string in Text::CSV_XS.

        This behaviour was deliberately chosen to match how Microsoft tools export/import data. Access will turn ,, into a NULL and ,"", into an empty string, and will export them that way as well. Or at least that was how it worked when I last used Access. I don't think that you can draw this distinction in Text::CSV_XS.

        Other than that, I believe that the two should have identical capabilities with default settings. But Text::CSV_XS has increased flexibility for changing how it handles parsing (eg changing quote character, delimiter, etc), and Text::xSV has increased flexibility for changing how you can access/manipulate data on the fly (aliases, computed fields).

        jZed,
        ...but I wish you and dragonchild would quit spreading FUD about Text::CSV_XS.

        I take offense to that. I only said I believed Text::CSV_xSV worked in cases where the other two did not. The focus was on the the use of the test suites anyway.

        And what FUD exactly? Besides this thread, I have mentioned Text::xSV two other times and neither was condemning the other modules.

        This post has been changed to reflect real information and not perceived information. I recognize jZed's point about spreading rumors which is why I didn't state my beliefs as being facts

        Cheers - L~R

Re: Successful CSV reader?
by salva (Canon) on Apr 26, 2005 at 09:16 UTC
    You can use Text::CSV_XS from CPAN to read CSV files. I have used it in several projects and works fine!
      The problem with that module that I've found is that there's no code to handle multi-line strings. It's designed around parsing one line at a time, but what if that line is part of a multi-line? Didn't find one that did handle it, so I wrote my own.
        Text::CSV_XS has a binary mode that allows for binary data (including newline chars) inside quotes. Have you tried it?
Re: Successful CSV reader?
by dragonchild (Archbishop) on Apr 26, 2005 at 12:59 UTC
    To further Limbic~Region's reply, you should be using Text::xSV because it handles more than just basic reading, does it modularly, and will most likely run faster and more portably than yours. Oh - it also handles writing, too. :-)

    The Perfect is the Enemy of the Good.

      Text::xSV seems to handle the multi-line stuff, which Text:CSV doesn't do because of the per-line format. Thanks.