CSV headers. Feedback wanted

Given small CSV data files or big(ger) CSV data files with a filter so that all of the data fits into memory, the Text::CSV_XS' csv function will most likely accomodate the common usage:

  use Text::CSV_XS qw( csv );
  my $aoa = csv (in => "file.csv");
[download]

This function also supports the common attributes for new:

  my $aoa = csv (in => "file.csv", sep_char => ";");
[download]

or even with shortcuts and aliasses:

  my $aoa = csv (in => "file.csv", sep => ";");
[download]

If there is lots to process inside each row, not all rows would fit into memory, or the callback structure and options for csv will obscure the code, reverting to the low level interface is the only way to go:

  use autodie;
  use Text::CSV_XS;

  my $csv = Text::CSV_XS->new (
      binary    => 1,
      auto_diag => 1,
      sep_char  => ";",
      });
  open my $fh, "<", "file.csv";
  while (my $row = $csv->getline ($fh)) {
      # do something with the row
      }
  close $fh;
[download]

Quite often a CSV data source has got one header line that holds the column names, which is easy to ask for in the csv funtion:

  # Default: return a list of lists (rows)
  my $aoa = csv (in => "file.csv");

  # Using the header line: return a list of hashes (records)
  my $aoh = csv (in => "file.csv", headers => "auto");
[download]

Or in low-level

  open my $fh, "<", "file.csv";
  my @hdr = @{$csv->getline ($fh)};
  $csv->column_names (@hdr);
  while (my $row = $csv->getline_hr ($fh)) {
      ...
[download]

This week I was confronted with a set of CSV files where the separator character was changing based on the content of the file. Oh, the horror! If the CSV file was expected to contain amounts, the program that did the export chose to use a ; separator and in other cases it used the default ,. IMHO the person that decided to do this should be fired without even blinking the eye.

This implied that on opening the CSV data stream, I - as a consumer - had to know in advance what this specific file would be like. Which made me come up with a new thought:

"If a CSV stream is supposed to have a header line that definess the column names, it is (very) unlikely that the column names will contain unpleasant characters like embedded newlines, semi-colons, or comma's. Remember, these are column names, not data rows. Not that it is prohibited to have header fields that have comma's or other non-word characters, but let us assume that it is uncommon enough to warrant support for easy of use."

So I wanted to convert this:

  open my $fh, "<", "file.csv";
  my @hdr = @{$csv->getline ($fh)};
  $csv->column_names (@hdr);
  while (my $row = $csv->getline_hr ($fh)) {
[download]

where the $csv instance has to know what the separator is, to

  open my $fh, "<", "file.csv";
  my @hdr = $csv->header ($fh);
  $csv->column_names (@hdr);
  while (my $row = $csv->getline_hr ($fh)) {
[download]

which will do the same, but also detect and set the separator.

where the new header method will read the first line of the already opened stream, detect the separator based on a default list of allowed separators, use the detected sparator to set sep_char for given $csv instance and use it to parse the line and return the result as a list.

As this came to me as common practice, before you parse the rest of your CSV, I came up with a local method (not (yet) in Text::CSV_XS) that does this for me:

sub Text::CSV_XS::header {
    my ($csv, $fh, $seps) = @_;
    my $hdr = lc <$fh> or return;
    foreach my $sep (@{$seps || [ ";", "," ]}) {
        index $hdr, $sep < 0 and next;
        $csv->sep_char ($sep);
        last;
        }
    open my $h, "<", \$hdr;
    my $row = $csv->getline ($h);
    close $h;
    @{$row // []};
    } # Text::CSV_XS::header
[download]

it even has some documentation :)

=head2 $csv->header ($fh)

Return the CSV header and set C<sep_char>.

  my @hdr = $csv->header ($fh);
  my @hdr = $csv->header ($fh, [ ";", ",", "|", "\t" ]);

Assuming that the file opened for parsing has a header, and the header
does not contain problematic characters like embedded newlines, read
the first line from the open handle, auto-detect whether the header
separates the column names with a character from the allowed separator
list. That list defaults to C<[ ";", "," ]> and can be overruled with
an optional second argument. If any of the allowed separators matches
(checks are done in order), set C<sep_char> to that sequence for the
current CSV_XS instance and use it to parse the first line and return
it as an array where all fields are mapped to lower case:

 my $csv = Text::CSV_XS->new ({ binary => 1, auto_diag => 1 });
 open my $fh, "<:encoding(iso-8859-1)", "file.csv";
 my @hdr = $csv->header ($fh) or die "file.csv has no header line\n";
 # $csv now has the correct sep_char
 while (my $row = $csv->getline ($fh)) {
     ...
     }
[download]

After two days of intensive use, I thought this might be useful to add to Text::CSV_XS so we all can profit, but I want to get it right from the start, so I ask for feedback (already got some from our local PM group)

Let the bikeshedding commence ...

Is this functionality useful enough to add at all
is $csv->header a useful method name (remember we also have low level methods to deal with hashes, like $csv->column_names)
Is the proposed API sufficient
Do you see any shortcomings

Things I envision in this function is to also auto-detect encoding when the line includes a BOM and set it to the stream using binmode or have some option to allow this new method to not only return the headers, but use them to set the column names:


  #---
  my $data = "foo,bar\r\n1,baz\r\n";
  open my $fh, "<", \$data;

  my @hdr = $csv->header ($fh); # ("foo", "bar")
  while (my $row = $csv->getline ($fh)) {
      # $row = [ "1", "baz" ]

  #---
  my $data = "foo;bar\r\n1;baz\r\n";
  open my $fh, "<", \$data;

  my @hdr = $csv->header ($fh); # ("foo", "bar")
  $csv->column_names (@hdr);
  while (my $row = $csv->getline_hr ($fh)) {
      # $row = { foo => "1", bar => "baz" }

  #---
  my $data = "foo|bar\r\n1|baz\r\n";
  open my $fh, "<", \$data;

  $csv->column_names ($csv->header ($fh, [ ";", ",", "|" ]));
  while (my $row = $csv->getline_hr ($fh)) {
      # $row = { foo => "1", bar => "baz" }
[download]

Enjoy, Have FUN! H.Merijn

Comment on CSV headers. Feedback wanted Select or Download Code

Replies are listed 'Best First'.
Re: CSV headers. Feedback wanted by duelafn (Parson) on Feb 10, 2016 at 14:47 UTC
I like the idea. My comments: Take separators as a key/value pair so that as other auto-detectable features are added they can be configured as well. The auto-detectable features should have reasonable defaults (you have default separators, this is just formalizing that I agree it should be so, but add tab to the default separators). The auto-detectable features should be disableable (perhaps via something like `no_eol => 1`). Don't force lower-case. I'd suggest a more obnoxious name so that it is clear that it is a magical routine. Perhaps `autoconfigure_from_header`. A quick skim of the documented options suggests the following candidates for auto-detection: BOM (encoding), eol, quote, escape_char, allow_whitespace (set only), allow_loose_quote (set only), allow_loose_escapes (set only), quote_empty. Several of those will be undeterminable from just the header (in which case should remain as default), but if detected should set the parameter. Good Day, Dean	[reply] [d/l] [select]
Re^2: CSV headers. Feedback wanted by Tux (Canon) on Feb 10, 2016 at 15:33 UTC
key/value pairs: I think I agree, but having the thing being used most like as suggested fits better with the other methods. Nothing carved in stone yet, though defaults. Once the API stabalized, I'll add them more clearly auto-detect. Good thought. All auto-detection on, and disable with key/value options. I like that lower case will be the default, but casing can be another option none/lc/uc/tc auto_header? autoconfigure_from_header seems like getting to high an expectation level BOM: yes, definitely eol: no. though possible, it is useless, as readline will only read a line that already is covered by the default eol setting (undef) quote: unreliable. will cause surprises escape: unreliable. will cause surprises allow_whitespace: possibly harmful. If the header contains spaces around the separators, it is likely the data will also have it, but that is not vice-versa. allow_loose_quote: no (see allow_whitespace) allow_loose_escapes: no (see allow_whitespace) quote_empty: no, it is very unlikely a header contains empty fields. If it does, using this new functionality will only cause more trouble than it is worth Enjoy, Have FUN! H.Merijn	[reply]
Re: CSV headers. Feedback wanted by bitingduck (Deacon) on Feb 11, 2016 at 06:49 UTC
I like but will give an example of a CSV format I'm using that is pathological and will give the wrong separator, but it's a special case and the number of "customers" for it is very restricted. I have a data acquisition system made by someone else that reads a few hundred parameters every 20 seconds or so. It writes them to disk locally and spits them out via UDP to a closed network (because of some firewall requirements). The data format for each line is: `val1,val2, val3,...valN,;,header1, header2, header3,...,headerN <CRLF>` but the first line of any given file is: `header1, header2, header3,...,headerN,;,header1, header2, header3,...,headerN <CRLF>` where the semicolon is in there to let the reader know that it's at the end of the value list, and the rest of the line is header information. The reason for the first line is that the reader of the file is likely going to load it into Excel and putting a header at the top reduces the risk of column alignment errors from copying the header information to the top of the Excel file. The reasoning behind the line format is that the listener (which is written in Perl) on the UDP port, which makes a backup copy of the data and serves it up to both a web page and a labview program that displays data plots, knows nothing about the details of the sender, and if the sender software is updated to add data columns, the listener has to notice and still properly display and label them. So if the listener sees a change in the header list it will start a new file with the correct header at the top of it Your autodetect looks like it would catch the semicolon and quit looking for separators. I admit that this is a pathological case and you're unlikely to see it in the wild, but oddball cases with more than one potential separator in the header line do exist. You might want to have an option to throw an error if more than one potential separator appears in the first line, and/or maybe an option to count the number of each type of potential separator and pick the one with the most instances. I think I'd rather see it default to an error message if there are multiple potential separators in the header, and have a flag that lets me tell it to suppress the error and guess.	[reply] [d/l] [select]
Re^2: CSV headers. Feedback wanted by Tux (Canon) on Feb 11, 2016 at 07:22 UTC
As you are stating yourself already, this is no standard CSV, and no auto-detection would work. The new header method is intended for the majority of CSV data file that contain a sane header. As this will be a new auxiliary / helper method that is not integrated in the normal flow, it is completely optional. Lets say it is just as optional as fragment or callbacks are, which are also there just to ease attacking specific problems in CSV parsing. Your example is clear, and from real-life data. Not just some data invented to point at the weaknesses new functionality, but with this data, it is obvious that this new method is not something that would help you at all. This new method is not created to help you here. In your case [fragment could help a lot: read the first line, detect the headers and the header count and then set fragment to only read the data part just before continuing reading the rest of the data. Enjoy, Have FUN! H.Merijn	[reply]
Re^3: CSV headers. Feedback wanted by bitingduck (Deacon) on Feb 11, 2016 at 14:57 UTC
It is legal CSV - the only thing goofy about it is that it has a semicolon on one of the places between commas, rather than useful header or data information. If I'm blindly using your new approach to read CSV files because, for example, I'm an unwitting recipient of the file, or because (more likely) I've forgotten about the format, I'd like the autodetect to warn me "Hey, this isn't quite standard CSV". I don't have a problem reading the data - we just read a whole line, use a regex to split it at the semicolon and away we go. It's intended as an example of something that will trip up a user of your autodetect. The lines in the file are long enough that a lazy user would look at the upper left and say "looks like CSV, I'll apply my generic CSV reader that use's Tux's nifty autodetect"	[reply]
Re^3: CSV headers. Feedback wanted by RonW (Parson) on Feb 12, 2016 at 23:57 UTC
First, I agree that guessing separators in a weird situation is too risky. Also, I think that detecting the presence of multiple possible separators and throwing an error makes sense. Yes, I know that the data resulting from choosing the wrong separator should look weird, but a person might not realize what they are seeing. Throwing an error when an easily detectable "weirdness" is seen will help alert the user to the situation.	[reply]
Re^4: CSV headers. Feedback wanted by bitingduck (Deacon) on Feb 14, 2016 at 16:18 UTC
Re: CSV headers. Feedback wanted by Tux (Canon) on Feb 16, 2016 at 15:40 UTC
All feedback weighed, I have now committed and pushed: header This method does NOT work in perl-5.6.x Parse the CSV header and set `sep_char` and encoding. `my @hdr = $csv->header ($fh)->column_names; $csv->header ($fh, [ ";", ",", "\|", "\t" ]); $csv->header ($fh, { bom => 1, fold => "lc" }); $csv->header ($fh, [ ",", ";" ], { bom => 1, fold => "lc" });` [download] The first argument should be a file handle. Assuming that the file opened for parsing has a header, and the header does not contain problematic characters like embedded newlines, read the first line from the open handle, auto-detect whether the header separates the column names with a character from the allowed separator list. That list defaults to `[ ";", "," ]` and can be overruled with an optional argument of an anonymous list of allowed separator sequences. If any of the allowed separators matches, and none of the other allowed separators match, set `sep_char` to that sequence for the current CSV_XS instance and use it to parse the first line, map those to lowercase, use that to set the instance column_names and return the instance: `my $csv = Text::CSV_XS->new ({ binary => 1, auto_diag => 1 }); open my $fh, "<:encoding(iso-8859-1)", "file.csv"; $csv->header ($fh); while (my $row = $csv->getline_hr ($fh)) { ... }` [download] If the header is empty, contains more than one unique separator out of the allowed set, contains empty fields, or contains identical fields (after folding), it will croak with error 1010, 1011, 1012, or 1013 respectively. This method will return the instance on success or undefined on failure if it did not croak. Options bom `$csv->header ($fh, { bom => 1 });` The default behavior is to detect if the header line starts with a BOM. If the header has a BOM, use that to set the encoding of `$fh`. This default behavior can be disabled by passing a false value to the `bom` option. Supported encodings from BOM are: UTF-8, UTF-16BE, UTF-16LE, UTF-32BE, UTF-32LE, UTF-1, UTF-EBCDIC, SCSU, BOCU-1, and GB-18030. UTF-7 is not supported. This is Work-In-Progress. currently only UTF-8 is working as expected fold `$csv->header ($fh, { fold => "lc" });` The default is to fold the header to lower case. You can also choose to fold the headers to upper case with `{ fold => "uc" }` or to leave the fields as-is with `{ fold => "none" }`. columns `$csv->header ($fh, { columns => 1 });` The default is to set the instances column names using column_names if the method is successful, so subsequent calls to getline_hr can return a hash. Disable setting the header can be forced using a false value for this option like `{ columns => 0 }`. Enjoy, Have FUN! H.Merijn	[reply] [d/l] [select]
Re: CSV headers. Feedback wanted by Laurent_R (Canon) on Feb 13, 2016 at 19:00 UTC
Quite an interesting idea. I think it makes sense. Just a short slightly off-topic comment. I am almost never using `Text::CSV_XS` for one single reason: in 95% of the cases where I would have a use for it, my data files are just too large to fit into memory. And even when they might fit, I can't take the risks of having a program failure because one day the input file is larger than expected. What I would really wish is an iterator version of `Text::CSV_XS`. Our CSV files are simple, I can work around my large file problem by just reading files line by line and using split, but I think it is a pity that `Text::CSV_XS` does not offer an iterator mode. Sorry for being off-topic, but I thought it was worth mentioning this problem with `Text::CSV_XS`. (And, BTW, I do not feel at this point like starting to write an XS module as an alternative.)	[reply] [d/l] [select]
Re^2: CSV headers. Feedback wanted by Tux (Canon) on Feb 13, 2016 at 20:58 UTC
The default low-level mode is iterative! The new `csv` function by default is pulling all data into memory, but by using a `filter` you can prevent that. The documentation shows examples. Iterative (straight from the docs): `use Text::CSV_XS; my $csv = Text::CSV_XS->new ({ binary => 1, auto_diag => 1 }); open my $fh, "<", "file.csv" or die "file.csv: $!"; while (my $row = $csv->getline ($fh)) { # do something with @$row } close $fh or die "file.csv: $!";` [download] Enjoy, Have FUN! H.Merijn	[reply] [d/l] [select]
Re^3: CSV headers. Feedback wanted by Laurent_R (Canon) on Feb 14, 2016 at 09:53 UTC
Really? Strange, I can swear that I tried it for at least a full day and it failed miserably each time the file was really large. I guess I must have used it wrongly then, possibly I used an example that was not adequate for my needs, but it's too long ago for me to remember the exact details. Well, thank you and thans to `AnonyMonk` for the information, I'll look at it again. And... sorry for the disruption with apparently inaccurate information.	[reply] [d/l]
Re^2: CSV headers. Feedback wanted by Anonymous Monk on Feb 13, 2016 at 20:53 UTC
what? The module is all about iterators, only sub csv slurps	[reply]
Re^2: CSV headers. Feedback wanted by Anonymous Monk on Feb 13, 2016 at 20:52 UTC
what? The module is all about iterators, only sub csv slurps	[reply]

header

Options