iburrell has asked for the wisdom of the Perl Monks concerning the following question:

I hate comma-separated files because there is no standard format and many programs product subtly broken files. The files I am having trouble reading don't escape quotes inside fields by doubling them to "".

I need a quick and dirty regular expression that can find all quotes not preceded or followed by commas and turn them into two quotes.

Replies are listed 'Best First'.
Re: Repairing bad CSV
by demerphq (Chancellor) on Jun 17, 2003 at 20:18 UTC

    Im extremely doubtful that you understand what you want properly. The first example meets your requirement (find all quotes not preceded or followed by commas and turn them into two quotes) the second a modified form using clases that doesnt help much, and the third is 'and' form (find all quotes neither preceded nor followed by commas and turn them into two quotes). None of them make much sense to me. *shrugs*

    #!perl -l for ('"Boo","Baz","Bar"','",",","') { { local $_=$_; s/(?<!,)"|"(?!,)/""/g; print "Or : $_\n"; } { local $_=$_; s/([^,])"|"(([^,]))/$1 ? $1.'""' : '""'.$2/ge; print "Or (class): $_\n"; } { local $_=$_; s/([^,])"([^,])/$1""$2/g; print "And: $_\n"; } } __END__ Or : ""Boo"",""Baz"",""Bar"" Or (class): ""Boo"",""Baz"",""Bar"" And: "Boo","Baz","Bar" Or : "",",","" Or (class): ",","," And: ",",","

    ---
    demerphq

    <Elian> And I do take a kind of perverse pleasure in having an OO assembly language...

    • Update:  
    On rereading this I realize that it came off sounding the wrong way, and that I had probably interpreted your requirements overly literally. My apologies for any offense caused. :-)


Re: Repairing bad CSV
by Enlil (Parson) on Jun 17, 2003 at 20:32 UTC
    Though this does not do exactly what you asked for I think it will do what you meant. That is change any quote that is preceded by something not that is not a comma (thus avoiding the first quoted thing on the string.), but then has to either be followed by something other than "the end of line or before the newline at the end or a comma".
    use strict; use warnings; while ( <DATA> ) { s/([^,])"(?!,|$)/$1""/g; print $_; } __DATA__ "A","B"",""C","D","E" ""F", ,"G"" "HI"J"K"L","M" 1,2,3,"t"l"r",4,5,6

    -enlil

Re: Repairing bad CSV
by monsieur_champs (Curate) on Jun 17, 2003 at 20:07 UTC

    Hello, fellow.

    You shall use this:

    s/([^,])\"([^,])/$1\"\"$2/g;

    I unfortunatelly have no data where to test this. Please let me know if I made a mistake. I will gladly correct the problem for you.

    =-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
    Just Another Perl Monk

      That mostly works. It will match \n at the end of the line. I also usually use not-matching group. This is what I ended up using:
      s/(?:[^,])"(?:[^,\n])/""/g;
Re: Repairing bad CSV
by clscott (Friar) on Jun 18, 2003 at 12:45 UTC

    Have you tried Text::CSV_XS or Text::CSV?

    Text::CSV_XS works for me all of the time, especially when the data comes from MS Excel.

    --
    Clayton aka "Tex"