in reply to Re^2: problems parsing CSV
in thread problems parsing CSV
I won't inflict the entire script on you, just the pertinent code snippet. The snippet I posted before now looks like this, with the old code commented out:
while <RLS> ) { # $_ = undiacritic($_); # s/\(\d{4} and after \"acid aerosols\" only\)//g; # $csv = Text::CSV->new(); $csv = Text::CSV->new ( { allow_loose_quotes => 1 , escape_char => "\\", binary => 1, } ) or die "" . Text::CSV->error_diag (); next if ($. == 1); if ($csv->parse($_)) { @cols = $csv->fields(); $tri = $cols[0]; $chem = $cols[2]; $year = $cols[4]; $lbs = $cols[5]; $gms = $cols[6]; } else { $err = $csv->error_input; print "Failed to parse line: $err"; } } close(RLS);
allow_loose_quotes did the trick on the embedded quotes - I found something that said to change the escape_char so it's not the same as the quote_char, so I did that as well. binary=>1 eliminated the need for undiacritic().
Now, instead of processing each record three times, I'm processing it once. With 1.7 million records, that is very nice.
Well, I thought I was done. It turns out that some of the fields (for lbs and gms) are "", some are 0.0, and some are something like 123.4 in the input file. I changed the assignment to this:
if (!$cols[5]) { $lbs = 0 } elsif ($cols[5] eq "0.0") { $lbs = 0 } els +e { $lbs = $cols[5] } if (!$cols[6]) {$gms = 0} elsif ($cols[5] eq "0.0") { $lbs = 0 } else +{ $gms = $cols[6] }
and then this test
if( !$lbs && !$gms )gives valid results. Is there a better way to do this? It seems rather clunky. I'd have thought that 0.0 would be interpreted the same as 0, but apparently not.
Thanks so much for your help.
|
|---|
| Replies are listed 'Best First'. | |
|---|---|
|
Re^4: problems parsing CSV
by Tux (Canon) on Oct 10, 2010 at 10:30 UTC | |
|
Re^4: problems parsing CSV
by Jim (Curate) on Oct 10, 2010 at 23:46 UTC | |
|
Re^4: problems parsing CSV
by Jim (Curate) on Oct 11, 2010 at 05:51 UTC | |
by Tux (Canon) on Oct 11, 2010 at 06:40 UTC | |
by Jim (Curate) on Oct 12, 2010 at 20:51 UTC | |
by Tux (Canon) on Oct 13, 2010 at 06:20 UTC | |
by Jim (Curate) on Oct 13, 2010 at 16:36 UTC |