in reply to Re^2: problems parsing CSV
in thread problems parsing CSV

I won't inflict the entire script on you, just the pertinent code snippet. The snippet I posted before now looks like this, with the old code commented out:

while <RLS> ) { # $_ = undiacritic($_); # s/\(\d{4} and after \"acid aerosols\" only\)//g; # $csv = Text::CSV->new(); $csv = Text::CSV->new ( { allow_loose_quotes => 1 , escape_char => "\\", binary => 1, } ) or die "" . Text::CSV->error_diag (); next if ($. == 1); if ($csv->parse($_)) { @cols = $csv->fields(); $tri = $cols[0]; $chem = $cols[2]; $year = $cols[4]; $lbs = $cols[5]; $gms = $cols[6]; } else { $err = $csv->error_input; print "Failed to parse line: $err"; } } close(RLS);

allow_loose_quotes did the trick on the embedded quotes - I found something that said to change the escape_char so it's not the same as the quote_char, so I did that as well. binary=>1 eliminated the need for undiacritic().

Now, instead of processing each record three times, I'm processing it once. With 1.7 million records, that is very nice.

Well, I thought I was done. It turns out that some of the fields (for lbs and gms) are "", some are 0.0, and some are something like 123.4 in the input file. I changed the assignment to this:

if (!$cols[5]) { $lbs = 0 } elsif ($cols[5] eq "0.0") { $lbs = 0 } els +e { $lbs = $cols[5] } if (!$cols[6]) {$gms = 0} elsif ($cols[5] eq "0.0") { $lbs = 0 } else +{ $gms = $cols[6] }

and then this test

  if( !$lbs && !$gms )

gives valid results. Is there a better way to do this? It seems rather clunky. I'd have thought that 0.0 would be interpreted the same as 0, but apparently not.

Thanks so much for your help.

Replies are listed 'Best First'.
Re^4: problems parsing CSV
by Tux (Canon) on Oct 10, 2010 at 10:30 UTC

    Code like

    while (<RLS>) { $csv = Text::CSV->new ( { allow_loose_quotes => 1 , escape_char => "\\", binary => 1, } ) or die "" . Text::CSV->error_diag (); next if ($. == 1); if ($csv->parse ($_)) {

    will very soon start failing when the fields have embedded newlines, as <RLS> doesn't see the quotation and escapes. When parsing CSV with Text::CSV or Text::CSV_XS, please use getline (). Also note that in your example you create a new csv parsing object for every line read. You should bring the object outside the loop. If possible, choose encoding on the lowest possible level: at open () time.

    my $csv = Text::CSV_XS->new ({ binary => 1, escape_char => "\\", allow_loose_quotes => 1, auto_diag => 1, }); open my $fh, "<:encoding(utf-8)", "text.csv" or die "text.csv: $!"; $csv->getline ($fh); # skip first line while (my $row = $csv->getline ($fh)) { my ($tri, $chem, $year, $lbs, $gms) = map { undiacritic ($_) } @{$row}[0,2,4,5,6]; : : } close $fh;

    Enjoy, Have FUN! H.Merijn
Re^4: problems parsing CSV
by Jim (Curate) on Oct 10, 2010 at 23:46 UTC

    Once again, I strongly recommend you move the object constructor outside the while loop. Tux, a.k.a. H. Merijn Brand, the author of Text::CSV_XS, which is the base module you're indirectly using, amplified this suggestion in his reply to your post. He explained that "you create a new csv parsing object for every line read." You really don't want to do this needlessly for 1.7 million records. And Tux's advice to use getline instead of <>/parse/fields is a wise one.

    I think you're trying too hard to handle the floating-point weight values. Just be explicit with your tests:

    if ($pounds == 0.0 and $grams == 0.0) { ... }

    The right coercions happen automagically. Here's the proof:

    D:\>perl -E "say '' == 0.0 ? 'Equals' : 'Does not equal'" Equals D:\>perl -E "say '0' == 0.0 ? 'Equals' : 'Does not equal'" Equals D:\>perl -E "say '0.0' == 0.0 ? 'Equals' : 'Does not equal'" Equals D:\>perl -E "say '123' == 0.0 ? 'Equals' : 'Does not equal'" Does not equal D:\>perl -E "say '123.4' == 0.0 ? 'Equals' : 'Does not equal'" Does not equal D:\>

    So you didn't like any of the other improvements I suggested in my refactored version of your script?

    Jim
Re^4: problems parsing CSV
by Jim (Curate) on Oct 11, 2010 at 05:51 UTC

    I've incorporated Tux's suggestion to use getline (getline_hr, actually) instead of <>/parse/fields. It really tightens up the whole script.

    #!/usr/bin/perl use strict; use warnings; use English qw( -no_match_vars ); use Text::CSV; $OUTPUT_FIELD_SEPARATOR = "\n"; $OUTPUT_RECORD_SEPARATOR = "\n"; my $release_file = '../ecodata/releases.txt'; # Text is in the ISO 8859-1 (Latin 1) encoding open my $release_fh, '<:encoding(iso-8859-1)', $release_file or die "Can't open release file $release_file: $OS_ERROR\n"; my $csv = Text::CSV->new({ auto_diag => 1, binary => 1, allow_loose_quotes => 1, escape_char => '\\', }); # Header is 'TRI,Release#,ChemName,RegNum,Year,Pounds,Grams' $csv->column_names($csv->getline($release_fh)); while (my $value = $csv->getline_hr($release_fh)) { { no warnings qw( numeric ); if ($value->{'Pounds'} == 0.0 and $value->{'Grams'} == 0.0) { warn "Release $value->{'Release#'} is weightless\n"; } } print $value->{'TRI'}, $value->{'Release#'}, $value->{'ChemName'}, $value->{'RegNum'}, $value->{'Year'}, $value->{'Pounds'}, $value->{'Grams'}; } close $release_fh; exit 0;

      The bind_columns () method is actually faster. It matters when your streams are big

      my $csv = Text::CSV->new ({ auto_diag => 1, binary => 1, allow_loose_quotes => 1, escape_char => "\\", }); # Header is 'TRI,Release#,ChemName,RegNum,Year,Pounds,Grams' my %value; $csv->bind_columns (\@value{@{$csv->getline ($release_fh)}}); while ($csv->getline_hr ($release_fh)) { { no warnings "numeric"; $value{Pounds} == 0.0 && $value->{Grams} == 0.0 and warn "Release $value->{'Release#'} is weightless\n"; } print $value{"TRI"}, $value{"Release#"}, $value{"ChemName"}, $value{"RegNum"}, $value{"Year"}, $value{"Pounds"}, $value{"Grams"}; }

      YMMV, bench to check if it also validates for your set of data. My speed comparison looks like this. In that image, the lower the line, the faster, so Text::CSV_XS with bindcolumns () (labeled "xs bndc") is the fastest on all sizes and the pure perl Text::CSV_PP counterpart with bindcolumns () (labeled "pp bndc") is the slowest, as it has the most overhead in pure perl. If you only look at the differences in the XS implementation, look at this graph.

      Update 1: removed the erroneous call to column_names () as spotted by jim.

      Update 2: New graphs: XS + PP and XS only


      Enjoy, Have FUN! H.Merijn

        Ok, here's the same script using bind_columns.

        #!/usr/bin/perl use strict; use warnings; use English qw( -no_match_vars ); use Text::CSV; $OUTPUT_FIELD_SEPARATOR = "\n"; $OUTPUT_RECORD_SEPARATOR = "\n"; my $release_file = '../ecodata/releases.txt'; open my $release_fh, '<', $release_file or die "Can't open release file $release_file: $OS_ERROR\n"; my $csv = Text::CSV->new({ auto_diag => 1, binary => 1, allow_loose_quotes => 1, escape_char => '\\', }); my %value; # Header is 'TRI,Release#,ChemName,RegNum,Year,Pounds,Grams' my @column_labels = $csv->column_names($csv->getline($release_fh)); $csv->bind_columns(\@value{@column_labels}); while ($csv->getline_hr($release_fh)) { { no warnings 'numeric'; if ($value{'Pounds'} == 0.0 and $value{'Grams'} == 0.0) { warn "Release $value{'Release#'} is weightless\n"; } } print $value{'TRI'}, $value{'Release#'}, $value{'ChemName'}, $value{'RegNum'}, $value{'Year'}, $value{'Pounds'}, $value{'Grams'}; } close $release_fh; exit 0;

        I had to change your...

        \@value{@{$csv->column_names($csv->getline($release_fh))}}
        ...to...
        \@value{$csv->column_names($csv->getline($release_fh))}