There are at least six CPAN modules that handle "text delimited formats'', more properly called "text separated formats'' such as CSV (Comma Separated Values). The chart below attempts to compare their parsing properties - how they handle and define the data format options. The chart does *not* show other differences between the modules (e.g. that Text::xSV has specialized formating and printing routines, that Text::CSV_XS has routines related to data typing, that the DBDs support DBI/SQL access to the data formats, etc.).
This comparison covers
Text::CSV
Text::xSV
Text::CSV_XS
DBD::CSV
DBD::AnyData
AnyData
I am the maintainer of the last three modules. If I've
inadvertently misrepresented any of the modules, it's out of
ignorance, please correct me. My congrats to Tilly, Alan Criterman, and Jochen Wiedmann, authors of the other execellent modules on the list
Comma Separated Values is not a single standard, it refers to a number
of slightly different ways to represent data. There is no "Correct
CSV'', only CSV that is correct according to the rules of a particular
CSV style. "Classic'' CSV, or the kind that many people think of when
they talk about CSV is a set of records separated by newlines with the
fields of the records separated by commas and the contents of the
fields (in some cases) delimited with double quote marks and with a
doubled-double-quote as an escape character within fields. But there
is AFAIK, no ISO or ANSI or other international standard definining
this "classic'' CSV as the one true CSV. All of the CPAN modules which
handle CSV formats allow redefinition of the separator character so
the format is really *SV, as it includes "tab delimited'' and "pipe
delimited'' formats which simply use tabs or pipes in the place where
CSV uses commas.
These words form a comma-SEPARATED, period-TERMINATED record with
four quote-DELIMITED fields.
"Just","Another","CSV","Hacker".
what goes between fields, a comma in classic CSV but
e.g. a tab or pipe in "tab delimited'' or "pipe delimited'' formats
what goes around fields, a pair of double quotes in classic CSV, but
some modules allow it to be redifined
the character used to escape the delimiter when it occurs embedded
in a filed, a double-quote in classic CSV (e.g. "this, ''"is''" one field'')
but some modules allow it to be redefined (e.g. to a backslash)
what goes between records, a newline in classic CSV, but some modules
allow it to be redifined; this can be critical if you are mixing CSV
files created on different operating systems without using something
like dos2unix to convert them since the newline is different on different
OSs; alternate record separators also allow data in "vertical" formats e.g. where a newline is a field separator and a double newline is a record separator
the ability to use the newline character inside a field, obviously critical if your data has newlines
sometimes this is the desired behaviour, e.g. if you are
prepping data for another program which won't accept embedded newlines
the ability to use binary data (e.g. NULL chars or ^L) embedded in
fields
again, sometimes this is the desired behaviour - if you are prepping for a program that won't accept binary data, you want the parser to fail on parsing
classic CSV uses sparse delimiting - it uses delimiters only around fields that need them,
e.g. those fields that have embedded commas, newlines, or quotes;
with sparse delimiting this is a valid 3-field record: foo,"bar,bop'',7
but some CSV styles always use delimiters for all fields, so some
modules support forcing delimiters onto all fields or onto all
non-numeric fields
Text::xSV differentiates between null (undefined values) and an empty string. The other modules treat them the same.
some of the modules are pure-perl and therefore can be installed
without compilation, others have C/XS componenents and require
a compilation on a specific platform; the C/XS modules are
generally faster than the pure perl modules
A plus mark indicates the presence of a feabugir (feature or bug or irrelevant, depending on the context), not necessarily that it is "better'' than a minus mark.
| Text::CSV | Text::xSV | Text::CSV_XS | DBD::CSV | AnyData |
accept newlines | - | + | + | + | * |
reject newlines | + | - | + | - | + |
accept embedded binary | - | + | + | + | + |
reject embedded binary | + | - | + | - | - |
forced delimiting | + | - | + | - | - |
sparse delimiting | - | + | + | + | + |
user-defined field sep | - | + | + | + | + |
user-defined delimiter | - | - | + | + | + |
user-defined escape | - | - | + | + | + |
user-defined record sep | - | - | + | + | + |
pure perl | + | + | - | - | + |
null handling | - | + | - | - | - |
Notes Some of the modules accept flags which can change their default behaviour, e.g. Text::CSV_XS defaults to rejecting newlines but can easily be set to accept them by passing the "binary'' flag. In these cases, they are shown with plus marks for all possible settings.
DBD::AnyData has the same properties as AnyData (which is a multi-level tied-hash interface to the data), both accept embedded newlines only if something other than newline is used as the record separator
DBD::CSV is actually built on top of Text::CSV_XS but since it
uses specific flags for Text::CSV_XS, its parsing properties
are somewhat different.
update added readmore tags
update2 added null handling
Re: Comparison of the parsing features of CSV (and xSV) modules
by dragonchild (Archbishop) on Jun 14, 2004 at 16:41 UTC
|
| [reply] [Watch: Dir/Any] |
|
You missed a few features of Text::xSV
It wasn't my intention to make a full comparison of the features, only of the CSV parsing. Like Text::xSV, AnyData and the DBDs also have many options for handling the resulting data structures, it would take a much lengthier meditation to compare all of those.
Adding a few features shouldn't be difficult.
Those features may or may not be worthwhile in Text::xSV. I didn't intend the minus signs in the chart to necessarily indicate something the modules should support. Then again if Tilly does add the user-defined features, I can use Text::xSV as a backend for DBD::AnyData so that there would be a DBI/SQL interface to his excellent module.
| [reply] [Watch: Dir/Any] |
|
The problem is not adding the features, it is figuring out the API for offering them, and seeing the need.
Which of the above user-defined features does DBD::AnyData need? All of them? Some? We probably should offline a discussion of how to provide them in a sane manner.
Some are already present in some form. For instance Text::xSV offers the ability to pre-filter input as it comes in. There is nothing to stop such a filter from saying, "These characters are not allowed." That won't work for eliminating embedded tabs in tab-delimited data. But it would for NUL characters. And if you want to disallow tabs embedded in a tab-delimited file, you could just wrap Text::xSV with the necessary validation logic.
Also as dragonchild points out, it reads based on $/ which can effectively allow you to change the record separator. Furthermore because I got tired of bug reports about invalid csv files, by default Text::xSV will treat either \n or \r\n as a newline, so files produced on Windows can be read on Unix and vice versa.
Some of the other missing features can't presently be worked around. For me the barrier isn't adding them if I thought people would use them, it is figuring out what a reasonable API should be.
| [reply] [Watch: Dir/Any] |
|
Re: Comparison of the parsing features of CSV (and xSV) modules
by hossman (Prior) on Jun 14, 2004 at 17:00 UTC
|
Great chart, Three suggestions...
- break out "accept newlines" into two seperate qulifications: "accept newlines with other record seperator" and "accept newlines with newline seperator" ... that will eliminate the need for the "*" and teh special footnote.
-
It might be usefull for people who aren't intimately familiar with *SV to see a simple sample set fo each of the cases described in that table.
-
Given a sample set for each case, a small sample program for each module showing how to parse that data would provide both a good example of how to use the module, as well as provide benchmark critera. You could then concievable replace those "+" signs with timing info.
(All of which is just to say: I thought about doing this myself and posting as a followup, but then I remembered I'm at work, and I'm planning on leaving early today, so I really can't justify spending that kind of time ... but I thought I'd through it out in case anyone else thought it sounded like a good idea and wanted to take a stab at it.)
| [reply] [Watch: Dir/Any] |
Re: Comparison of the parsing features of CSV (and xSV) modules
by perrin (Chancellor) on Jun 14, 2004 at 17:37 UTC
|
Nice comparison. A small lowest-common-denominator benchmark would be cool too. | [reply] [Watch: Dir/Any] |
Re: Comparison of the parsing features of CSV (and xSV) modules
by dbush (Deacon) on Jun 15, 2004 at 06:45 UTC
|
Great node jZed. Definitely a ++ required.
One additional thing that would be useful, for me at least, is how each handles the third field for the following inputs:
"Just","Another",,"Hacker"
"Just","Another","","Hacker"
My view of the standard is that the first example should give undef and the second should be an empty string.
| [reply] [Watch: Dir/Any] |
Re: Comparison of the parsing features of CSV (and xSV) modules
by bsb (Priest) on Jun 15, 2004 at 01:57 UTC
|
I'd like to see an interface style row on your comparision.
Text::CSV_XS has such a crummy interface for both input
and output. All that error flags checking and unnecessary
OO. Are the others any better?
I want something as simple as split & join, that just
dies on bad input and is configurable via package vars
or a parser object. It seems like the Text::CSV_XS
interface was not driven by "make simple things simple".
It probably only needs a wrapper module.
(Seems like AnyData and DBD provide useful interfaces for
their respective domains though) | [reply] [Watch: Dir/Any] |
Re: Comparison of the parsing features of CSV (and xSV) modules
by Wally Hartshorn (Hermit) on Jun 15, 2004 at 14:12 UTC
|
Great post! I obviously have an interest in this given my recent self-inflicted problems with processing a CSV file. :-)
About the only thing missing from most (all?) of these modules is a way to handle embedded, unescaped delimiter characters. (Wouldn't clean input data be nice? *sigh*) Perhaps setting the delimiter character to the empty string would trigger a separate set of logic that would handle that case. (A delimiter character followed by a separator character or a newline would be a real closing delimiter character, while others would be ignored, perhaps.)
Not that I'm volunteering, of course. :-)
| [reply] [Watch: Dir/Any] |
|
What would be some example data, how it's currently being parsed, and how you'd like it to be parsed?
------
We are the carpenters and bricklayers of the Information Age.
Then there are Damian modules.... *sigh* ... that's not about being less-lazy -- that's about being on some really good drugs -- you know, there is no spoon. - flyingmoose
I shouldn't have to say this, but any code, unless otherwise stated, is untested
| [reply] [Watch: Dir/Any] |
|
| [reply] [Watch: Dir/Any] [d/l] [select] |
|
|
|
|
Re: Comparison of the parsing features of CSV (and xSV) modules
by Tux (Canon) on Feb 19, 2008 at 12:54 UTC
|
| [reply] [Watch: Dir/Any] [d/l] [select] |
Re: Comparison of the parsing features of CSV (and xSV) modules
by Tux (Canon) on May 08, 2007 at 13:37 UTC
|
Just for the record, Text::CSV_XS does know the difference between NULL/undef and empty strings. This will show when using always_quote => 1:
my $csv = Text::CSV_XS->new ({ always_quote => 1});
$csv->combine ("", 1, undef, "bar");
print $csv->string, "\r\n";
Will result in:
"","1",,"bar"
Note the missing quotes for the undef entry
If requested, I can consider adding an option to do the reverse also: when always_quoted is true, return empty columns as undef in the $csv->fields call.
Enjoy, Have FUN! H.Merijn
| [reply] [Watch: Dir/Any] [d/l] [select] |
Re: Comparison of the parsing features of CSV (and xSV) modules
by Tux (Canon) on Mar 01, 2008 at 16:57 UTC
|
With the release of Text-CSV_XS 0.35 (Valloire), I think I have addressed the undef question enough to be really useful:
> perl -MData::Dumper -MText::CSV_XS -l \
-e 'my $csv=Text::CSV_XS->new ({ blank_is_undef => 1 });' \
-e '$csv->parse (q{1,,"", });' \
-e 'print Dumper$csv->fields'
$VAR1 = '1';
$VAR2 = undef;
$VAR3 = '';
$VAR4 = ' ';
Enjoy, Have FUN! H.Merijn
| [reply] [Watch: Dir/Any] [d/l] |
|
|