cavac has asked for the wisdom of the Perl Monks concerning the following question:

Hi

I have to parse a text file (comma seperated fields), more or less a standard CSV file. The exception here is that one of the fields can have multiple lines.

I know this is a bad design, but can't change it since it's from an old data export.

The file looks somewhat like this (simplified to highlight the problem):

"1", "title1", "hello world", "foo" "2", "title2", "hallo welt", "bar" "3", "title3", "this is a very long line", "baz"

To add to the problem, the file can be a few gigs in size and has quoted characters as well.

I know i'm capable of writing a parser on my own, using a simple state machine (been there, done that, got the headache). But i rather use a tried and tested method than spending the next ten days hunting for obscure bugs.

Can you recommend me a module that works on this specific file format?. The goal here is to extract the data fields line-by-line and put them into a database.

Don't use '#ff0000':
use Acme::AutoColor; my $redcolor = RED();
All colors subject to change without notice.

Replies are listed 'Best First'.
Re: Parsing CSV with multiline fields
by Tux (Canon) on Sep 02, 2011 at 14:47 UTC

    Not bad design at all, and something Text::CSV can perfectly deal with. If the files are indeed that big, consider to also install Text::CSV_XS which is up to 100 times faster than the bundled Text::CSV_PP.

    Extra note, if the format is indeed as you've posted, you'd probably also have to look at the allow_whitespace attribute.


    Enjoy, Have FUN! H.Merijn
Re: Parsing CSV with multiline fields
by Ratazong (Monsignor) on Sep 02, 2011 at 14:48 UTC
    Have you checked Text::CSV? It claims to handle newlines if the mode is set to binary.
      It claims to handle newlines ...

      In particular, see the discussion of  <> versus the  getline() method in the Embedded newlines section of Text::CSV.

      Binary mode!!!1!. Gaaaah, totally overlooked that one. *facepalm*

      Thanks, you just saved my weekend!

      Don't use '#ff0000':
      use Acme::AutoColor; my $redcolor = RED();
      All colors subject to change without notice.