in reply to Re: problems parsing CSV
in thread problems parsing CSV
Thanks for all the suggestions. Some of it is over my head, so I will need to meditate on it.
I'm afraid I wasn't clear. The example I gave was not a real example, but an example of how the data looks that's not getting processed correctly.
The input file is over a million records, and new data will be added from time to time. The problem input is not all the same but is formatted like my previous example. It is similar enough that I was able to get rid of it with a substitution, as I'm not even using that portion of the data. But it is ugly and slow.
I've simplified the code to show the parsing attempt.
#!/usr/bin/perl -w use strict; use Text::Undiacritic qw(undiacritic); use Text::CSV; my ( $tri, $chem, $year, $lbs, $gms, $rls, $csv, $err, @cols ); # open the release report file for input #$rls = "rls.tst"; $rls = "../ecodata/releases.txt"; open( RLS, $rls ) || die "bad open $rls"; # RLS: TRI, Release#, ChemName, RegNum, Year, Pounds, Grams while( <RLS> ) { $_ = undiacritic($_); s/\(\d{4} and after \"acid aerosols\" only\)//g; $csv = Text::CSV->new(); next if ($. == 1); if ($csv->parse($_)) { @cols = $csv->fields(); $tri = $cols[0]; $chem = $cols[2]; $year = $cols[4]; $lbs = $cols[5]; $gms = $cols[6]; } else { $err = $csv->error_input; print "Failed to parse line: $err"; } } close(RLS);
Here is a tiny bit of the output before I put in the substitution - I'm sure there is a better way to do this:
Failed to parse line: 00617BRSTLSTATE,"1394080382029","Sulfuric acid ( +1994 and after "acid aerosols" only)",7664-93-9,1994,500.0,"" Failed to parse line: 00617BRSTLSTATE,"1394080382031","Hydrochloric ac +id (1995 and after "acid aerosols" only)",7647-01-0,1994,2842.0,""
I hope this is more clear. Thanks so much for your help; now I must go meditate over what you've suggested.
|
|---|
| Replies are listed 'Best First'. | |
|---|---|
|
Re^3: problems parsing CSV
by Jim (Curate) on Oct 10, 2010 at 04:02 UTC | |
by helenwoodson (Acolyte) on Oct 10, 2010 at 10:56 UTC | |
|
Re^3: problems parsing CSV
by Jim (Curate) on Oct 10, 2010 at 05:13 UTC | |
|
Re^3: problems parsing CSV
by helenwoodson (Acolyte) on Oct 10, 2010 at 06:12 UTC | |
by Tux (Canon) on Oct 10, 2010 at 10:30 UTC | |
by Jim (Curate) on Oct 10, 2010 at 23:46 UTC | |
by Jim (Curate) on Oct 11, 2010 at 05:51 UTC | |
by Tux (Canon) on Oct 11, 2010 at 06:40 UTC | |
by Jim (Curate) on Oct 12, 2010 at 20:51 UTC | |
by Tux (Canon) on Oct 13, 2010 at 06:20 UTC | |
|