in reply to Auto correct a csv file

It took me a while to see what your problem is. You have some fields in your original csv file that contain embedded line breaks, but there's no quoting or escaping provided, so a "normal" CSV parse won't work very well.

When a line ends with a comma, you're able to join it with the following line, by removing the line-break(s) after the comma. (But your regex does it wrong: it won't apply at all on LF-style data, it won't handle extra blank lines properly for CRLF-style data, and it removes the comma, which should be kept.)

When a line begins with a comma, you want to join it to the previous line, but the previous line has already been processed and written to output, so it's too late to fix that.

So, don't do it one line at a time - process the whole file as a single string:

perl -e '$/=undef; $_=<>; s/,[\r\n]+/,/g; s/[\r\n]+,/,/g; print' infil +e > infile.fxd
Those regexes preserve the commas, and handle any number of consecutive line breaks before or after a comma (for both LF and CRLF data).

(Note that I'm redirecting output to a different file, rather than replacing the original - that makes it easy to "try, try again" for cases like this, where you seldom get it right the first time. Once you get it right, then you can rename the output to replace the input.)

(Update: P.S.: Welcome to the Monastery!)

(Updated again to add remarks about LF vs. CRLF data)

Replies are listed 'Best First'.
Re^2: Auto correct a csv file
by karthikAk (Initiate) on Feb 19, 2014 at 12:51 UTC

    This works perfectly. But it deletes the very first line. How to prevent this?

      … But it deletes the very first line. How to prevent this?

      Can you post an example data file to demonstrate this problem? (For example, just copy the first few lines from the "real" data file in question and put them in a separate file. Then run the command line to produce a modified version of that. If the first line of input is missing from the output, please post the input, the command line you actually used, and the output.

      I suspect that one (or more) of the following could be happening:

      • You aren't using the exact code that I posted.
      • What you think should be "the first line" of your input file is not actually in that file to begin with (that is, it was missing before you ran the script).
      • There's something goofy going on with carriage-return characters in your data, and the first line really is there, but you might not be "seeing" it because maybe it ends with just CR instead of CRLF, which might cause the 2nd line to be printed "on top of" the first one in your display.

      I have no way to answer your question without knowing more about your data.