Given small CSV data files or big(ger) CSV data files with a filter so that all of the data fits into memory, the Text::CSV_XS' csv function will most likely accomodate the common usage:
use Text::CSV_XS qw( csv ); my $aoa = csv (in => "file.csv");
This function also supports the common attributes for new:
my $aoa = csv (in => "file.csv", sep_char => ";");
or even with shortcuts and aliasses:
my $aoa = csv (in => "file.csv", sep => ";");
If there is lots to process inside each row, not all rows would fit into memory, or the callback structure and options for csv will obscure the code, reverting to the low level interface is the only way to go:
use autodie; use Text::CSV_XS; my $csv = Text::CSV_XS->new ( binary => 1, auto_diag => 1, sep_char => ";", }); open my $fh, "<", "file.csv"; while (my $row = $csv->getline ($fh)) { # do something with the row } close $fh;
Quite often a CSV data source has got one header line that holds the column names, which is easy to ask for in the csv funtion:
# Default: return a list of lists (rows) my $aoa = csv (in => "file.csv"); # Using the header line: return a list of hashes (records) my $aoh = csv (in => "file.csv", headers => "auto");
Or in low-level
open my $fh, "<", "file.csv"; my @hdr = @{$csv->getline ($fh)}; $csv->column_names (@hdr); while (my $row = $csv->getline_hr ($fh)) { ...
This week I was confronted with a set of CSV files where the separator character was changing based on the content of the file. Oh, the horror! If the CSV file was expected to contain amounts, the program that did the export chose to use a ; separator and in other cases it used the default ,. IMHO the person that decided to do this should be fired without even blinking the eye.
This implied that on opening the CSV data stream, I - as a consumer - had to know in advance what this specific file would be like. Which made me come up with a new thought:
"If a CSV stream is supposed to have a header line that definess the column names, it is (very) unlikely that the column names will contain unpleasant characters like embedded newlines, semi-colons, or comma's. Remember, these are column names, not data rows. Not that it is prohibited to have header fields that have comma's or other non-word characters, but let us assume that it is uncommon enough to warrant support for easy of use."
So I wanted to convert this:
open my $fh, "<", "file.csv"; my @hdr = @{$csv->getline ($fh)}; $csv->column_names (@hdr); while (my $row = $csv->getline_hr ($fh)) {
where the $csv instance has to know what the separator is, to
open my $fh, "<", "file.csv"; my @hdr = $csv->header ($fh); $csv->column_names (@hdr); while (my $row = $csv->getline_hr ($fh)) {
which will do the same, but also detect and set the separator.
where the new header method will read the first line of the already opened stream, detect the separator based on a default list of allowed separators, use the detected sparator to set sep_char for given $csv instance and use it to parse the line and return the result as a list.
As this came to me as common practice, before you parse the rest of your CSV, I came up with a local method (not (yet) in Text::CSV_XS) that does this for me:
sub Text::CSV_XS::header { my ($csv, $fh, $seps) = @_; my $hdr = lc <$fh> or return; foreach my $sep (@{$seps || [ ";", "," ]}) { index $hdr, $sep < 0 and next; $csv->sep_char ($sep); last; } open my $h, "<", \$hdr; my $row = $csv->getline ($h); close $h; @{$row // []}; } # Text::CSV_XS::header
it even has some documentation :)
=head2 $csv->header ($fh) Return the CSV header and set C<sep_char>. my @hdr = $csv->header ($fh); my @hdr = $csv->header ($fh, [ ";", ",", "|", "\t" ]); Assuming that the file opened for parsing has a header, and the header does not contain problematic characters like embedded newlines, read the first line from the open handle, auto-detect whether the header separates the column names with a character from the allowed separator list. That list defaults to C<[ ";", "," ]> and can be overruled with an optional second argument. If any of the allowed separators matches (checks are done in order), set C<sep_char> to that sequence for the current CSV_XS instance and use it to parse the first line and return it as an array where all fields are mapped to lower case: my $csv = Text::CSV_XS->new ({ binary => 1, auto_diag => 1 }); open my $fh, "<:encoding(iso-8859-1)", "file.csv"; my @hdr = $csv->header ($fh) or die "file.csv has no header line\n"; # $csv now has the correct sep_char while (my $row = $csv->getline ($fh)) { ... }
After two days of intensive use, I thought this might be useful to add to Text::CSV_XS so we all can profit, but I want to get it right from the start, so I ask for feedback (already got some from our local PM group)
Let the bikeshedding commence ...
Things I envision in this function is to also auto-detect encoding when the line includes a BOM and set it to the stream using binmode or have some option to allow this new method to not only return the headers, but use them to set the column names:
#--- my $data = "foo,bar\r\n1,baz\r\n"; open my $fh, "<", \$data; my @hdr = $csv->header ($fh); # ("foo", "bar") while (my $row = $csv->getline ($fh)) { # $row = [ "1", "baz" ] #--- my $data = "foo;bar\r\n1;baz\r\n"; open my $fh, "<", \$data; my @hdr = $csv->header ($fh); # ("foo", "bar") $csv->column_names (@hdr); while (my $row = $csv->getline_hr ($fh)) { # $row = { foo => "1", bar => "baz" } #--- my $data = "foo|bar\r\n1|baz\r\n"; open my $fh, "<", \$data; $csv->column_names ($csv->header ($fh, [ ";", ",", "|" ])); while (my $row = $csv->getline_hr ($fh)) { # $row = { foo => "1", bar => "baz" }
|
---|
Replies are listed 'Best First'. | |
---|---|
Re: CSV headers. Feedback wanted
by duelafn (Parson) on Feb 10, 2016 at 14:47 UTC | |
by Tux (Canon) on Feb 10, 2016 at 15:33 UTC | |
Re: CSV headers. Feedback wanted
by bitingduck (Deacon) on Feb 11, 2016 at 06:49 UTC | |
by Tux (Canon) on Feb 11, 2016 at 07:22 UTC | |
by bitingduck (Deacon) on Feb 11, 2016 at 14:57 UTC | |
by RonW (Parson) on Feb 12, 2016 at 23:57 UTC | |
by bitingduck (Deacon) on Feb 14, 2016 at 16:18 UTC | |
Re: CSV headers. Feedback wanted
by Tux (Canon) on Feb 16, 2016 at 15:40 UTC | |
Re: CSV headers. Feedback wanted
by Laurent_R (Canon) on Feb 13, 2016 at 19:00 UTC | |
by Tux (Canon) on Feb 13, 2016 at 20:58 UTC | |
by Laurent_R (Canon) on Feb 14, 2016 at 09:53 UTC | |
by Anonymous Monk on Feb 13, 2016 at 20:53 UTC | |
by Anonymous Monk on Feb 13, 2016 at 20:52 UTC |