in reply to CSV file

My problem is that one of the notes fields contains some unescaped quotation marks, new line characters, and commas.

Then it is not CSV :)

If the new line characters and commas are in balanced quotation marks, you might be able to process your file with a regex (how large is your file?). The quotes themselves will be a problem unless you only pick quotes not followed by a comma.

It'd help if you posted a few lines that are hard to parse together with a description of how you would like to have them.

Juerd # { site => 'juerd.nl', plp_site => 'plp.juerd.nl', do_not_use => 'spamtrap' }

Replies are listed 'Best First'.
Re: Re: CSV file
by xorl (Deacon) on Mar 09, 2004 at 16:51 UTC
    Then it is not CSV :)
    I tried telling them that, but they won't fix how the files are being written. Here is a generalized problem record:
    "data", "moredata", "the notes field "with unescaped quotes" and
    lets have a hard return in here
    in a few places,
    and while we're having fun with that, why not
    have some "random, commas too!", "lots more data", "lastfield in the record"
    
    Does that help any?
      Is every field in quotes, then? If so, you might be able to read lines until you match quote-newline at the end of a line, strip the leading and trailing quotes, then split the whole mess on /",\s*"/. Like so (untested):
      my $line = ''; my @fields; while (<>) { $line .= $_; if (/"\n/ or eof) { $line =~ s/^"//; $line =~ s/"$//; @fields = split(/",\s*/); $line = ''; } }

      The PerlMonk tr/// Advocate
        This looks like it just might work. I will test it and work along these lines. Every field is should be enclosed in quotes and in briefly looking at the files I don't see where splitting on ",\s would be a problem. Thanks!
      Do you always know how many fields to expect? If not, it may be impossible to properly interrpret the file ie:
      "data", "moredate", "this is a field", " and another coma for good mea +sure with unescaped quotes followed by a coma"
      was the preceding data 1,2,3, or 4 fields? impossible to tell. Even if you know how many fields there should be it's an impossible problem say there should be 3 fields:
      "data blah","foooo","hmmmmmmmm","what if there arnt even numbers of qu +otes"
      what are the 3 fields? there are several perfectly legal variants. Impossible.


      daN.

        The example was one record with 5 fields. The 3rd field is this annoying notes field that has all of the problems.

        The first field is supposed to be the tracking number (actually it's alpha-numeric). The other fields are like first and last name, status of the inquiry, name of the judge.

        With the exception of the notes field (i.e. field3) everything should be limited to no more 50 characters and should just have letters except the id field which is alpha-numeric. Of course the notes field contains the other stuff which I described (and which causes all of the problems).

      If you always have the same amount of fields, you might get away with:

      my $regex = join ', ', ('".*?"') x 5; $regex = qr/$regex\n/s; while (my @fields = $data =~ /\G$regex/g) { ... }
      (untested as usual)

      Juerd # { site => 'juerd.nl', plp_site => 'plp.juerd.nl', do_not_use => 'spamtrap' }

      Okay, I'm assuming the following, since the specs are a little unclear:
      • you can separate one record from another
      • the first two fields are consistently quoted
      • the third field always begins and ends with a quote
      • the last two fields are consistently quoted
      • only the third field can contain returns, commas or quotes

      #!/usr/bin/perl use strict; use warnings; my $data = do { local $/ = undef; <DATA> }; my @fields = $data =~ /"(.*?)",\s+"(.*?)",\s+"(.*)",\s+"(.*?)",\s+"(.* +?)"/s; print "RESULTS:\n"; print ":$_:\n" for @fields; __DATA__ "data", "moredata", "the notes field "with unescaped quotes" and lets have a hard return in here in a few places, and while we're having fun with that, why not have some "random, commas too!", "lots more data", "lastfield in the r +ecord"

      The first two and last two captures of the regex are non-greedy, and the middle one sucks everything up. Hope this helps...