comment on

I won't inflict the entire script on you, just the pertinent code snippet. The snippet I posted before now looks like this, with the old code commented out:

while  <RLS> )
{
#        $_ = undiacritic($_);
#        s/\(\d{4} and after \"acid aerosols\" only\)//g;
#        $csv = Text::CSV->new();
         $csv = Text::CSV->new (
            { allow_loose_quotes     => 1 ,
              escape_char            => "\\",
              binary                 => 1, 
            }
         ) or die "" . Text::CSV->error_diag ();
        next if ($. == 1);
        if ($csv->parse($_)) {
                @cols = $csv->fields();
                $tri    = $cols[0];
                $chem   = $cols[2];
                $year   = $cols[4];
                $lbs    = $cols[5];
                $gms    = $cols[6];
        }
        else {
                $err = $csv->error_input;
                print "Failed to parse line: $err";
        }
}
close(RLS);
[download]

allow_loose_quotes did the trick on the embedded quotes - I found something that said to change the escape_char so it's not the same as the quote_char, so I did that as well. binary=>1 eliminated the need for undiacritic().

Now, instead of processing each record three times, I'm processing it once. With 1.7 million records, that is very nice.

Well, I thought I was done. It turns out that some of the fields (for lbs and gms) are "", some are 0.0, and some are something like 123.4 in the input file. I changed the assignment to this:

if (!$cols[5]) { $lbs = 0 } elsif ($cols[5] eq "0.0") { $lbs = 0 } els
+e { $lbs = $cols[5] }
if (!$cols[6]) {$gms = 0} elsif ($cols[5] eq "0.0") { $lbs = 0 } else 
+{ $gms = $cols[6] }
[download]

and then this test

if( !$lbs && !$gms )

gives valid results. Is there a better way to do this? It seems rather clunky. I'd have thought that 0.0 would be interpreted the same as 0, but apparently not.

Thanks so much for your help.

In reply to Re^3: problems parsing CSV by helenwoodson
in thread problems parsing CSV by helenwoodson

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.