Re^3: problems parsing CSV

I won't inflict the entire script on you, just the pertinent code snippet. The snippet I posted before now looks like this, with the old code commented out:

while  <RLS> )
{
#        $_ = undiacritic($_);
#        s/\(\d{4} and after \"acid aerosols\" only\)//g;
#        $csv = Text::CSV->new();
         $csv = Text::CSV->new (
            { allow_loose_quotes     => 1 ,
              escape_char            => "\\",
              binary                 => 1, 
            }
         ) or die "" . Text::CSV->error_diag ();
        next if ($. == 1);
        if ($csv->parse($_)) {
                @cols = $csv->fields();
                $tri    = $cols[0];
                $chem   = $cols[2];
                $year   = $cols[4];
                $lbs    = $cols[5];
                $gms    = $cols[6];
        }
        else {
                $err = $csv->error_input;
                print "Failed to parse line: $err";
        }
}
close(RLS);
[download]

allow_loose_quotes did the trick on the embedded quotes - I found something that said to change the escape_char so it's not the same as the quote_char, so I did that as well. binary=>1 eliminated the need for undiacritic().

Now, instead of processing each record three times, I'm processing it once. With 1.7 million records, that is very nice.

Well, I thought I was done. It turns out that some of the fields (for lbs and gms) are "", some are 0.0, and some are something like 123.4 in the input file. I changed the assignment to this:

if (!$cols[5]) { $lbs = 0 } elsif ($cols[5] eq "0.0") { $lbs = 0 } els
+e { $lbs = $cols[5] }
if (!$cols[6]) {$gms = 0} elsif ($cols[5] eq "0.0") { $lbs = 0 } else 
+{ $gms = $cols[6] }
[download]

and then this test

if( !$lbs && !$gms )

gives valid results. Is there a better way to do this? It seems rather clunky. I'd have thought that 0.0 would be interpreted the same as 0, but apparently not.

Thanks so much for your help.

Comment on Re^3: problems parsing CSV Select or Download Code

Replies are listed 'Best First'.
Re^4: problems parsing CSV by Tux (Canon) on Oct 10, 2010 at 10:30 UTC
Code like `while (<RLS>) { $csv = Text::CSV->new ( { allow_loose_quotes => 1 , escape_char => "\\", binary => 1, } ) or die "" . Text::CSV->error_diag (); next if ($. == 1); if ($csv->parse ($_)) {` [download] will very soon start failing when the fields have embedded newlines, as `<RLS>` doesn't see the quotation and escapes. When parsing CSV with Text::CSV or Text::CSV_XS, please use `getline ()`. Also note that in your example you create a new csv parsing object for every line read. You should bring the object outside the loop. If possible, choose encoding on the lowest possible level: at `open ()` time. `my $csv = Text::CSV_XS->new ({ binary => 1, escape_char => "\\", allow_loose_quotes => 1, auto_diag => 1, }); open my $fh, "<:encoding(utf-8)", "text.csv" or die "text.csv: $!"; $csv->getline ($fh); # skip first line while (my $row = $csv->getline ($fh)) { my ($tri, $chem, $year, $lbs, $gms) = map { undiacritic ($_) } @{$row}[0,2,4,5,6]; : : } close $fh;` [download] Enjoy, Have FUN! H.Merijn	[reply] [d/l] [select]
Re^4: problems parsing CSV by Jim (Curate) on Oct 10, 2010 at 23:46 UTC
Once again, I strongly recommend you move the object constructor outside the while loop. Tux, a.k.a. H. Merijn Brand, the author of Text::CSV_XS, which is the base module you're indirectly using, amplified this suggestion in his reply to your post. He explained that "you create a new csv parsing object for every line read." You really don't want to do this needlessly for 1.7 million records. And Tux's advice to use getline instead of <>/parse/fields is a wise one. I think you're trying too hard to handle the floating-point weight values. Just be explicit with your tests: `if ($pounds == 0.0 and $grams == 0.0) { ... }` [download] The right coercions happen automagically. Here's the proof: `D:\>perl -E "say '' == 0.0 ? 'Equals' : 'Does not equal'" Equals D:\>perl -E "say '0' == 0.0 ? 'Equals' : 'Does not equal'" Equals D:\>perl -E "say '0.0' == 0.0 ? 'Equals' : 'Does not equal'" Equals D:\>perl -E "say '123' == 0.0 ? 'Equals' : 'Does not equal'" Does not equal D:\>perl -E "say '123.4' == 0.0 ? 'Equals' : 'Does not equal'" Does not equal D:\>` [download] So you didn't like any of the other improvements I suggested in my refactored version of your script? Jim	[reply] [d/l] [select]
Re^4: problems parsing CSV by Jim (Curate) on Oct 11, 2010 at 05:51 UTC
I've incorporated Tux's suggestion to use getline (getline_hr, actually) instead of <>/parse/fields. It really tightens up the whole script. #!/usr/bin/perl use strict; use warnings; use English qw( -no_match_vars ); use Text::CSV; $OUTPUT_FIELD_SEPARATOR = "\n"; $OUTPUT_RECORD_SEPARATOR = "\n"; my $release_file = '../ecodata/releases.txt'; # Text is in the ISO 8859-1 (Latin 1) encoding open my $release_fh, '<:encoding(iso-8859-1)', $release_file or die "Can't open release file $release_file: $OS_ERROR\n"; my $csv = Text::CSV->new({ auto_diag => 1, binary => 1, allow_loose_quotes => 1, escape_char => '\\', }); # Header is 'TRI,Release#,ChemName,RegNum,Year,Pounds,Grams' $csv->column_names($csv->getline($release_fh)); while (my $value = $csv->getline_hr($release_fh)) { { no warnings qw( numeric ); if ($value->{'Pounds'} == 0.0 and $value->{'Grams'} == 0.0) { warn "Release $value->{'Release#'} is weightless\n"; } } print $value->{'TRI'}, $value->{'Release#'}, $value->{'ChemName'}, $value->{'RegNum'}, $value->{'Year'}, $value->{'Pounds'}, $value->{'Grams'}; } close $release_fh; exit 0; [download]	[reply] [d/l]
Re^5: problems parsing CSV by Tux (Canon) on Oct 11, 2010 at 06:40 UTC
The `bind_columns ()` method is actually faster. It matters when your streams are big my $csv = Text::CSV->new ({ auto_diag => 1, binary => 1, allow_loose_quotes => 1, escape_char => "\\", }); # Header is 'TRI,Release#,ChemName,RegNum,Year,Pounds,Grams' my %value; $csv->bind_columns (\@value{@{$csv->getline ($release_fh)}}); while ($csv->getline_hr ($release_fh)) { { no warnings "numeric"; $value{Pounds} == 0.0 && $value->{Grams} == 0.0 and warn "Release $value->{'Release#'} is weightless\n"; } print $value{"TRI"}, $value{"Release#"}, $value{"ChemName"}, $value{"RegNum"}, $value{"Year"}, $value{"Pounds"}, $value{"Grams"}; } [download] YMMV, bench to check if it also validates for your set of data. My speed comparison looks like this. In that image, the lower the line, the faster, so Text::CSV_XS with `bindcolumns ()` (labeled "xs bndc") is the fastest on all sizes and the pure perl Text::CSV_PP counterpart with `bindcolumns ()` (labeled "pp bndc") is the slowest, as it has the most overhead in pure perl. If you only look at the differences in the XS implementation, look at this graph. Update 1: removed the erroneous call to `column_names ()` as spotted by jim. Update 2: New graphs: XS + PP and XS only Enjoy, Have FUN! H.Merijn	[reply] [d/l] [select]
Re^6: problems parsing CSV by Jim (Curate) on Oct 12, 2010 at 20:51 UTC
Ok, here's the same script using bind_columns. #!/usr/bin/perl use strict; use warnings; use English qw( -no_match_vars ); use Text::CSV; $OUTPUT_FIELD_SEPARATOR = "\n"; $OUTPUT_RECORD_SEPARATOR = "\n"; my $release_file = '../ecodata/releases.txt'; open my $release_fh, '<', $release_file or die "Can't open release file $release_file: $OS_ERROR\n"; my $csv = Text::CSV->new({ auto_diag => 1, binary => 1, allow_loose_quotes => 1, escape_char => '\\', }); my %value; # Header is 'TRI,Release#,ChemName,RegNum,Year,Pounds,Grams' my @column_labels = $csv->column_names($csv->getline($release_fh)); $csv->bind_columns(\@value{@column_labels}); while ($csv->getline_hr($release_fh)) { { no warnings 'numeric'; if ($value{'Pounds'} == 0.0 and $value{'Grams'} == 0.0) { warn "Release $value{'Release#'} is weightless\n"; } } print $value{'TRI'}, $value{'Release#'}, $value{'ChemName'}, $value{'RegNum'}, $value{'Year'}, $value{'Pounds'}, $value{'Grams'}; } close $release_fh; exit 0; [download] I had to change your... `\@value{@{$csv->column_names($csv->getline($release_fh))}}` [download] ...to... `\@value{$csv->column_names($csv->getline($release_fh))}` [download]	[reply] [d/l] [select]
Re^7: problems parsing CSV by Tux (Canon) on Oct 13, 2010 at 06:20 UTC
Re^8: problems parsing CSV by Jim (Curate) on Oct 13, 2010 at 16:36 UTC