in reply to Re^3: problems parsing CSV
in thread problems parsing CSV

I've incorporated Tux's suggestion to use getline (getline_hr, actually) instead of <>/parse/fields. It really tightens up the whole script.

#!/usr/bin/perl use strict; use warnings; use English qw( -no_match_vars ); use Text::CSV; $OUTPUT_FIELD_SEPARATOR = "\n"; $OUTPUT_RECORD_SEPARATOR = "\n"; my $release_file = '../ecodata/releases.txt'; # Text is in the ISO 8859-1 (Latin 1) encoding open my $release_fh, '<:encoding(iso-8859-1)', $release_file or die "Can't open release file $release_file: $OS_ERROR\n"; my $csv = Text::CSV->new({ auto_diag => 1, binary => 1, allow_loose_quotes => 1, escape_char => '\\', }); # Header is 'TRI,Release#,ChemName,RegNum,Year,Pounds,Grams' $csv->column_names($csv->getline($release_fh)); while (my $value = $csv->getline_hr($release_fh)) { { no warnings qw( numeric ); if ($value->{'Pounds'} == 0.0 and $value->{'Grams'} == 0.0) { warn "Release $value->{'Release#'} is weightless\n"; } } print $value->{'TRI'}, $value->{'Release#'}, $value->{'ChemName'}, $value->{'RegNum'}, $value->{'Year'}, $value->{'Pounds'}, $value->{'Grams'}; } close $release_fh; exit 0;

Replies are listed 'Best First'.
Re^5: problems parsing CSV
by Tux (Canon) on Oct 11, 2010 at 06:40 UTC

    The bind_columns () method is actually faster. It matters when your streams are big

    my $csv = Text::CSV->new ({ auto_diag => 1, binary => 1, allow_loose_quotes => 1, escape_char => "\\", }); # Header is 'TRI,Release#,ChemName,RegNum,Year,Pounds,Grams' my %value; $csv->bind_columns (\@value{@{$csv->getline ($release_fh)}}); while ($csv->getline_hr ($release_fh)) { { no warnings "numeric"; $value{Pounds} == 0.0 && $value->{Grams} == 0.0 and warn "Release $value->{'Release#'} is weightless\n"; } print $value{"TRI"}, $value{"Release#"}, $value{"ChemName"}, $value{"RegNum"}, $value{"Year"}, $value{"Pounds"}, $value{"Grams"}; }

    YMMV, bench to check if it also validates for your set of data. My speed comparison looks like this. In that image, the lower the line, the faster, so Text::CSV_XS with bindcolumns () (labeled "xs bndc") is the fastest on all sizes and the pure perl Text::CSV_PP counterpart with bindcolumns () (labeled "pp bndc") is the slowest, as it has the most overhead in pure perl. If you only look at the differences in the XS implementation, look at this graph.

    Update 1: removed the erroneous call to column_names () as spotted by jim.

    Update 2: New graphs: XS + PP and XS only


    Enjoy, Have FUN! H.Merijn

      Ok, here's the same script using bind_columns.

      #!/usr/bin/perl use strict; use warnings; use English qw( -no_match_vars ); use Text::CSV; $OUTPUT_FIELD_SEPARATOR = "\n"; $OUTPUT_RECORD_SEPARATOR = "\n"; my $release_file = '../ecodata/releases.txt'; open my $release_fh, '<', $release_file or die "Can't open release file $release_file: $OS_ERROR\n"; my $csv = Text::CSV->new({ auto_diag => 1, binary => 1, allow_loose_quotes => 1, escape_char => '\\', }); my %value; # Header is 'TRI,Release#,ChemName,RegNum,Year,Pounds,Grams' my @column_labels = $csv->column_names($csv->getline($release_fh)); $csv->bind_columns(\@value{@column_labels}); while ($csv->getline_hr($release_fh)) { { no warnings 'numeric'; if ($value{'Pounds'} == 0.0 and $value{'Grams'} == 0.0) { warn "Release $value{'Release#'} is weightless\n"; } } print $value{'TRI'}, $value{'Release#'}, $value{'ChemName'}, $value{'RegNum'}, $value{'Year'}, $value{'Pounds'}, $value{'Grams'}; } close $release_fh; exit 0;

      I had to change your...

      \@value{@{$csv->column_names($csv->getline($release_fh))}}
      ...to...
      \@value{$csv->column_names($csv->getline($release_fh))}

        You are now mixing two approaches. When using bind_columns () you should not use getline_hr () but getline (), because you are not returning a hashref but reading into prebound variables:

        my @column_labels = @{$csv->getline ($release_fh)}; $csv->bind_columns (\@value{@column_labels}); while ($csv->getline ($release_fh)) { : }

        You do not use the method column_names () at all. That was a cut-n-paste error from your code in my previous example. Mea culpa.

        \@value{@{$csv->column_names ($csv->getline ($release_fh))}} => \@value{@{$csv->getline ($release_fh)}};

        Enjoy, Have FUN! H.Merijn