xorl has asked for the wisdom of the Perl Monks concerning the following question:

Hello,

I've been given a bunch of comma seperated value files. I need to extract the data and stuff it into a MySQL database.

That's simple, but I have a problem. My problem is that one of the notes fields contains some unescaped quotation marks, new line characters, and commas. This of course means I can't just assume each line is a new record and that every comma is really the field divider.

Before I go and spend countless hours trying to figure out this myself, I thought someone else must have had this problem and already has a solution. If anyone can help I'd greatly appreciate it.

Thanks,

-xorl

Replies are listed 'Best First'.
Re: CSV file
by Juerd (Abbot) on Mar 09, 2004 at 16:44 UTC

    My problem is that one of the notes fields contains some unescaped quotation marks, new line characters, and commas.

    Then it is not CSV :)

    If the new line characters and commas are in balanced quotation marks, you might be able to process your file with a regex (how large is your file?). The quotes themselves will be a problem unless you only pick quotes not followed by a comma.

    It'd help if you posted a few lines that are hard to parse together with a description of how you would like to have them.

    Juerd # { site => 'juerd.nl', plp_site => 'plp.juerd.nl', do_not_use => 'spamtrap' }

      Then it is not CSV :)
      I tried telling them that, but they won't fix how the files are being written. Here is a generalized problem record:
      "data", "moredata", "the notes field "with unescaped quotes" and
      lets have a hard return in here
      in a few places,
      and while we're having fun with that, why not
      have some "random, commas too!", "lots more data", "lastfield in the record"
      
      Does that help any?
        Is every field in quotes, then? If so, you might be able to read lines until you match quote-newline at the end of a line, strip the leading and trailing quotes, then split the whole mess on /",\s*"/. Like so (untested):
        my $line = ''; my @fields; while (<>) { $line .= $_; if (/"\n/ or eof) { $line =~ s/^"//; $line =~ s/"$//; @fields = split(/",\s*/); $line = ''; } }

        The PerlMonk tr/// Advocate
        Do you always know how many fields to expect? If not, it may be impossible to properly interrpret the file ie:
        "data", "moredate", "this is a field", " and another coma for good mea +sure with unescaped quotes followed by a coma"
        was the preceding data 1,2,3, or 4 fields? impossible to tell. Even if you know how many fields there should be it's an impossible problem say there should be 3 fields:
        "data blah","foooo","hmmmmmmmm","what if there arnt even numbers of qu +otes"
        what are the 3 fields? there are several perfectly legal variants. Impossible.


        daN.

        If you always have the same amount of fields, you might get away with:

        my $regex = join ', ', ('".*?"') x 5; $regex = qr/$regex\n/s; while (my @fields = $data =~ /\G$regex/g) { ... }
        (untested as usual)

        Juerd # { site => 'juerd.nl', plp_site => 'plp.juerd.nl', do_not_use => 'spamtrap' }

        Okay, I'm assuming the following, since the specs are a little unclear:
        • you can separate one record from another
        • the first two fields are consistently quoted
        • the third field always begins and ends with a quote
        • the last two fields are consistently quoted
        • only the third field can contain returns, commas or quotes

        #!/usr/bin/perl use strict; use warnings; my $data = do { local $/ = undef; <DATA> }; my @fields = $data =~ /"(.*?)",\s+"(.*?)",\s+"(.*)",\s+"(.*?)",\s+"(.* +?)"/s; print "RESULTS:\n"; print ":$_:\n" for @fields; __DATA__ "data", "moredata", "the notes field "with unescaped quotes" and lets have a hard return in here in a few places, and while we're having fun with that, why not have some "random, commas too!", "lots more data", "lastfield in the r +ecord"

        The first two and last two captures of the regex are non-greedy, and the middle one sucks everything up. Hope this helps...

Re: CSV file
by Abigail-II (Bishop) on Mar 09, 2004 at 16:44 UTC
    IIRC, MySQL can load data from a CVS file directly just fine.

    Abigail

Re: CSV file
by diotalevi (Canon) on Mar 09, 2004 at 17:34 UTC
      Text::xSV handles this better than Text::CSV and Text::CSV_XS. Don't write code to parse the data - have the module do the work for you.

      Yes Text::xSV is by far better. It has proven more useful than the other soltuions.

      Thanks everyone!

Re: CSV file
by kvale (Monsignor) on Mar 09, 2004 at 16:51 UTC
    In CSV files, quotes are used to protect things like commas that are not delimiters. I'd recommend using a module like Text::CSV_XS to handle your data:
    use Text::CSV_XS; $csv = Text::CSV_XS->new(); # create a new object $csv->types(\@t_array); # Set column types $columns = $csv->getline($io); # Read a line from file $i +o, parse it # and return an array ref +of fields

    -Mark