Beefy Boxes and Bandwidth Generously Provided by pair Networks
Pathologically Eclectic Rubbish Lister
 
PerlMonks  

Finding CR-LFs within quoted CSV fields

by Wally Hartshorn (Hermit)
on Jun 11, 2004 at 16:07 UTC ( [id://363456]=perlquestion: print w/replies, xml ) Need Help??

Wally Hartshorn has asked for the wisdom of the Perl Monks concerning the following question:

My apologies if this is an FAQ, but I've been unable to find an answer.

I have a comma-separated values (CSV) file exported from a FoxPro database. The text fields within the CSV file are enclosed within double-quotes.

Unfortunately, some of the fields contain embedded CR-LF characters. The DBD::CSV module interprets those CR-LF characters as end-of-record markers. I haven't found any way to tell DBD::CSV that a CR-LF pair within a quoted field is not an end-of-record marker, so I'm using a regex to convert the CR-LF pairs into HTML <br> tags. (The text will be displayed in a browser, so that's what I want anyway.)

The code I've developed to do this is ugly and not bulletproof. It assumes that a quote, followed by a CR-LF, followed by a quote should be treated as the closing quote of a the last field of one record, followed by an end-of-record, followed by the opening quote of the first field of the next record. This is not always true.

{ undef $/; $slurp = <$fh>; # slurp up the file # append a bogus final quote to the end of the file $slurp .= '"'; # replace the end-of-record CR-LFs $slurp =~ s/"\r\n"/"__EOR__"/g; # replace the other CR-LFs with <br> tags $slurp =~ s/\r\n/<br>/g; # restore the end-of-record CR-LFs $result =~ s/"__EOR__"/"\r\n"/g; # remove the bogus final quote $result =~ s/"$//; }

The correct way to do this would be to loop through the file, counting opening quotes and closing quotes, replacing any CR-LFs within an opening quote/closing quote pair with HTML <br> tags.

Doing this as a while() loop seems awkward. Is there some elegant regex that would handle this?

Wally Hartshorn

Replies are listed 'Best First'.
Re: Finding CR-LFs within quoted CSV fields
by particle (Vicar) on Jun 11, 2004 at 16:12 UTC

    try tilly's Text::xSV insead. it handles CSV files with embedded newlines.

    ~Particle *accelerates*

Re: Finding CR-LFs within quoted CSV fields
by jZed (Prior) on Jun 11, 2004 at 17:09 UTC
    You are wrong about DBD::CSV. It fully supports newlines embedded in quoted fields. It does this by default, no special actions are required on your part. (I am its maintainer, I should know). If you weren't able to get it to work, I suspect that your data is using different line endings than your code. Please show your data and your code.

      Hmm... That's odd. This is an existing Perl program that is using DBD::CSV and was working fine for months until last week. The user brought up the web-based upload form, uploaded their CSV file (exported from FoxPro), and got 0 records imported. We looked at the file they were uploading and discovered there were a few records that had embedded CR-LFs. Adding a regex to pre-process the uploaded file and remove the embedded CR-LFs solved the problem and allowed DBD::CSV to process the file.

      If DBD::CSV can already handle that on its own, then I'm mystified as to why our regex would solve the problem. It looks like we'll need to investigate further. We'll double-check things on Monday to make sure we got the results we thought and will get back you. Thanks!

      Wally Hartshorn

        If DBD::CSV can already handle that on its own, then I'm mystified as to why our regex would solve the problem
        Me too. Since diotalevi seems to have had the same problem, it's possible there's a bug somewhere. I'd really appreciate help in tracking it down. Could you and diotalevi both please let me know which versions of SQL::Statement and DBD::File you're using, it's more likely to be in them.
      That is odd. I stopped using this module for exactly this reason as well. I will try to get back to you about what your module was dying on.
        Thanks, I appreciate the help. It's most likely in SQL::S or DBD::F rather than DBD::CSV itself.

        update diotalevi msg'd me that it was a different module he had been thinking of, that he actually doesn't use DBD::CSV.

Re: Finding CR-LFs within quoted CSV fields
by Your Mother (Archbishop) on Jun 12, 2004 at 22:37 UTC

    I'm not saying this is the answer, or even safe depending on your data, but I had a similar situation solved by it. I was using a DB_File to YAML to DB_File script that was running perfectly for months until is just broke one day. Running this on the plain text data fixed it.

    $text =~ s/[^[:print:]]//g;

    (Probably had inserted a bunch of invisible garbage from users cutting and pasting out of different apps.)

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://363456]
Approved by particle
Front-paged by exussum0
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others sharing their wisdom with the Monastery: (9)
As of 2024-03-28 12:24 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found