Thanks for all the suggestions. Some of it is over my head, so I will need to meditate on it.

I'm afraid I wasn't clear. The example I gave was not a real example, but an example of how the data looks that's not getting processed correctly.

The input file is over a million records, and new data will be added from time to time. The problem input is not all the same but is formatted like my previous example. It is similar enough that I was able to get rid of it with a substitution, as I'm not even using that portion of the data. But it is ugly and slow.

I've simplified the code to show the parsing attempt.

#!/usr/bin/perl -w use strict; use Text::Undiacritic qw(undiacritic); use Text::CSV; my ( $tri, $chem, $year, $lbs, $gms, $rls, $csv, $err, @cols ); # open the release report file for input #$rls = "rls.tst"; $rls = "../ecodata/releases.txt"; open( RLS, $rls ) || die "bad open $rls"; # RLS: TRI, Release#, ChemName, RegNum, Year, Pounds, Grams while( <RLS> ) { $_ = undiacritic($_); s/\(\d{4} and after \"acid aerosols\" only\)//g; $csv = Text::CSV->new(); next if ($. == 1); if ($csv->parse($_)) { @cols = $csv->fields(); $tri = $cols[0]; $chem = $cols[2]; $year = $cols[4]; $lbs = $cols[5]; $gms = $cols[6]; } else { $err = $csv->error_input; print "Failed to parse line: $err"; } } close(RLS);

Here is a tiny bit of the output before I put in the substitution - I'm sure there is a better way to do this:

Failed to parse line: 00617BRSTLSTATE,"1394080382029","Sulfuric acid ( +1994 and after "acid aerosols" only)",7664-93-9,1994,500.0,"" Failed to parse line: 00617BRSTLSTATE,"1394080382031","Hydrochloric ac +id (1995 and after "acid aerosols" only)",7647-01-0,1994,2842.0,""

I hope this is more clear. Thanks so much for your help; now I must go meditate over what you've suggested.


In reply to Re^2: problems parsing CSV by helenwoodson
in thread problems parsing CSV by helenwoodson

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.