csv has asked for the wisdom of the Perl Monks concerning the following question:

I have been looking for a perl module that supports reading csv formatted data that contains embedded newlines from a file. I haven't had any luck, however, this is not a new issue and I'm sure someone has solved the problem and packaged it.

Is anyone aware of any perl CSV module/package that supports fields with embeded newlines? An example record would be:

  123,"This is a field
  with embedded newlines
  in it.",city,state,11101

Replies are listed 'Best First'.
Re: CSV with embedded newlines (Text::xSV)
by tye (Sage) on Apr 03, 2003 at 17:10 UTC

    Text::xSV was written to address this issue.

                    - tye
Re: CSV with embedded newlines
by Mr. Muskrat (Canon) on Apr 03, 2003 at 16:37 UTC
Re: CSV with embedded newlines
by kal (Hermit) on Apr 04, 2003 at 08:24 UTC

    I'm pretty sure that Tie::CSV gets this wrong, unfortunately - I've tried to use it in the past, and never had much success. However, Text::CSV has worked for me with this type of file (well, Text::CSV_XS, which should be the same thing ;)

    The basic loop structure that I have is something like:

    my $csv = Text::CSV_XS->new({'binary' => 1}); my $current_line; while (<CVSFILE>) { $current_line .= $_; next unless ($csv->parse ($current)); my @row = $csv->fields(); $current = ''; # do stuff with the rows... }

    This seemed to hold up against some really bizarre files (very _very_ big files, for example), so it looks pretty good. Strangely, I think Tie::CSV uses the Text::CSV module to read lines in, but I don't think it gets the loop right (like above). However, this might be my memory playing tricks.

      I'm pretty sure that kal is right here, the key feature being the use of the "binary" option. One of the many silly things about CSV_XS is that you *always* want to use the "binary" option (this is reminiscent of FTP in the old days, before they got a clue and made "binary" the default). Pretty much all real text is "binary" from the point of view of CSV_XS (e.g. if you want to use any iso8859 extended characters).

      But the last time I looked you definitely wanted to use "CSV_XS", not the older "CSV" module. They're really not the same.

Re: CSV with embedded newlines
by mattr (Curate) on Apr 06, 2003 at 09:48 UTC
    I recently was silly enough to get myself talked into using CSV for storage. Anyway one bug I found was simply that admins were typing a multiline comment into a field, which did not get caught by the DBI wrapper nor by binary type (CSV_XS). So I just escaped it with a pipe sigil like "|CR" (of course also escaping standalone pipemarks too) and restored the carriage return later. You should also note that in Excel you can type somehow a vertical tab which will break CSV too. My recommendation is don't use CSV, just generate it with CSV_XS or Text::CSV maybe when outputting it. I used a download button which would let you open it in Excel. In general these problems do come up but as far as I know, CSV's basic premise is one line per record.

    Personally I would not trust that binary setting, it didn't work for me and I got corrupted data when using 8-bit (Japanese SJIS) data. Constrain/pack your data into the ascii range mentioned in one of the CSV module docs or sweat it.

Re: CSV with embedded newlines
by spurperl (Priest) on Apr 03, 2003 at 17:04 UTC
    You are not defining the problem precisely. What is the spec you'd want the code/module to comply to ?

    What is considered a new record (a new line of CSV) against an embedded new line for your needs ?
    In the example you gave, how would you tell it's all just one records ? Because of the ""s ? Counting fields ? If so, how would you take care of erroneous input ?
    Give these points a thought - it may help you think of a solution and help us help you.
Re: CSV with embedded newlines
by Super Monkey (Beadle) on Apr 03, 2003 at 17:38 UTC
    you could always clean the .csv file before you parse it. use regex() to remove any unwanted newlines, etc...
Re: CSV with embedded newlines
by aquarium (Curate) on Apr 04, 2003 at 10:48 UTC
    Set your record separator accordingly: LF or CRLF. Chris