in reply to Re: DBD::CSV and embedded CR-LFs
in thread Finding CR-LFs within quoted CSV fields

Ah HA! (*smack* -> head) I've figured out the problem, and it has nothing to do with DBD::CSV.

The CSV file that we're processing is exported in an unusual manner. Embedded quotes aren't escaped in any way. We have a routine that pre-processes the CSV to find those embedded quotes and escapes them (such that a " becomes a "" pair). The processed CSV file is then handed off to DBD::CSV for normal processing.

Recently, the user added a field to the CSV file, which promptly broke our program. In the process of trying to figure out the problem (because, of course, they didn't tell us they had added a field, only that the program had stopped working), we discovered that CR-LFs were embedded in some fields. We then leaped to the (incorrect) conclusion that this was a recent occurence and the cause of our problems.

We later learned about the addition of the new field. However, it turned out that the new field also triggered a bug in our pre-processing code, but we didn't know this at the time.

In trying to fix the supposed problem with the embedded URLs, we had replaced the buggy pre-processing code. So, when we ran the program with the (correct) CR-LF replacement code, it worked. When we reverted to the (buggy) old pre-processing code, it stopped working. That led us to the incorrect conclusion that the CR-LF replacement code was needed.

So, in summary, DBD::CSV handles embedded CR-LFs fine. Mystery solved! (Of course, if there's a nifty setting to handle unescaped quotes, I'd be glad to learn of it!)

Wally Hartshorn

Replies are listed 'Best First'.
Re: DBD::CSV handles embedded CR-LFs fine!
by jZed (Prior) on Jun 14, 2004 at 23:14 UTC
    I've figured out the problem,
    I'm glad for you! :-)
    and it has nothing to do with DBD::CSV.
    I'm glad for me! :-)
    Of course, if there's a nifty setting to handle unescaped quotes, I'd be glad to learn of it!)
    Are the fields already quote delimited? In other words do they resemble number 1 or number 2?
       1. "foo","bad " bad","bar"
       2. foo,bad " bad,bar   
    
    If you have records of type #1, I don't have much to suggest beyond the pre-processing you are already doing. If your records are of type #2, however, you ought to be able to handle that all within DBD::CSV. Set csv_delim_char to undef, in which case the stray quote will not be an embedded delimiter, it will just be a quote because you will have set the delimiter character to be undef rather than double-quote. If that won't work, show us some data and maybe someone will have a suggestion.

    Thanks for following up to let us know how it turned out.

      The data is like type #1, unfortunately. It would be something like this:

      "Smith","John",12/31/1962,"Author of "How to Break Programs" and other books","Bugger"

      I'm using a series regexes to change that to:

      "Smith","John",12/31/1962,"Author of ""How to Break Programs"" and other books","Bugger"

      It would be nice if the user just gave us a valid CSV file to begin with, but....

      Thanks, and sorry for the confusion!

      Wally Hartshorn