I won't inflict the entire script on you, just the pertinent code snippet. The snippet I posted before now looks like this, with the old code commented out:
while <RLS> ) { # $_ = undiacritic($_); # s/\(\d{4} and after \"acid aerosols\" only\)//g; # $csv = Text::CSV->new(); $csv = Text::CSV->new ( { allow_loose_quotes => 1 , escape_char => "\\", binary => 1, } ) or die "" . Text::CSV->error_diag (); next if ($. == 1); if ($csv->parse($_)) { @cols = $csv->fields(); $tri = $cols[0]; $chem = $cols[2]; $year = $cols[4]; $lbs = $cols[5]; $gms = $cols[6]; } else { $err = $csv->error_input; print "Failed to parse line: $err"; } } close(RLS);
allow_loose_quotes did the trick on the embedded quotes - I found something that said to change the escape_char so it's not the same as the quote_char, so I did that as well. binary=>1 eliminated the need for undiacritic().
Now, instead of processing each record three times, I'm processing it once. With 1.7 million records, that is very nice.
Well, I thought I was done. It turns out that some of the fields (for lbs and gms) are "", some are 0.0, and some are something like 123.4 in the input file. I changed the assignment to this:
if (!$cols[5]) { $lbs = 0 } elsif ($cols[5] eq "0.0") { $lbs = 0 } els +e { $lbs = $cols[5] } if (!$cols[6]) {$gms = 0} elsif ($cols[5] eq "0.0") { $lbs = 0 } else +{ $gms = $cols[6] }
and then this test
if( !$lbs && !$gms )gives valid results. Is there a better way to do this? It seems rather clunky. I'd have thought that 0.0 would be interpreted the same as 0, but apparently not.
Thanks so much for your help.
In reply to Re^3: problems parsing CSV
by helenwoodson
in thread problems parsing CSV
by helenwoodson
| For: | Use: | ||
| & | & | ||
| < | < | ||
| > | > | ||
| [ | [ | ||
| ] | ] |