There are at least six CPAN modules that handle "text delimited formats'', more properly called "text separated formats'' such as CSV (Comma Separated Values). The chart below attempts to compare their parsing properties - how they handle and define the data format options. The chart does *not* show other differences between the modules (e.g. that Text::xSV has specialized formating and printing routines, that Text::CSV_XS has routines related to data typing, that the DBDs support DBI/SQL access to the data formats, etc.).
modules
This comparison covers
Text::CSV Text::xSV Text::CSV_XS DBD::CSV DBD::AnyData AnyData
disclaimer
I am the maintainer of the last three modules. If I've inadvertently misrepresented any of the modules, it's out of ignorance, please correct me. My congrats to Tilly, Alan Criterman, and Jochen Wiedmann, authors of the other execellent modules on the list
- First some definitions:
- CSV
- field separator
- delimiter char
- escape char
- record separator
- ability to accept embedded newlines
- ability to reject embedded newlines
- ability to accept embedded binary data
- ability to reject embedded binary data
- ability to allow sparse delimiting
- support for forced delimited writes
- null handling
- pure perl
- The Comparison Chart
First some definitions:
CSV
Comma Separated Values is not a single standard, it refers to a number of slightly different ways to represent data. There is no "Correct CSV'', only CSV that is correct according to the rules of a particular CSV style. "Classic'' CSV, or the kind that many people think of when they talk about CSV is a set of records separated by newlines with the fields of the records separated by commas and the contents of the fields (in some cases) delimited with double quote marks and with a doubled-double-quote as an escape character within fields. But there is AFAIK, no ISO or ANSI or other international standard definining this "classic'' CSV as the one true CSV. All of the CPAN modules which handle CSV formats allow redefinition of the separator character so the format is really *SV, as it includes "tab delimited'' and "pipe delimited'' formats which simply use tabs or pipes in the place where CSV uses commas.
These words form a comma-SEPARATED, period-TERMINATED record with four quote-DELIMITED fields.
"Just","Another","CSV","Hacker".
field separator
what goes between fields, a comma in classic CSV but e.g. a tab or pipe in "tab delimited'' or "pipe delimited'' formats
delimiter char
what goes around fields, a pair of double quotes in classic CSV, but some modules allow it to be redifined
escape char
the character used to escape the delimiter when it occurs embedded in a filed, a double-quote in classic CSV (e.g. "this, ''"is''" one field'') but some modules allow it to be redefined (e.g. to a backslash)
record separator
what goes between records, a newline in classic CSV, but some modules allow it to be redifined; this can be critical if you are mixing CSV files created on different operating systems without using something like dos2unix to convert them since the newline is different on different OSs; alternate record separators also allow data in "vertical" formats e.g. where a newline is a field separator and a double newline is a record separator
ability to accept embedded newlines
the ability to use the newline character inside a field, obviously critical if your data has newlines
ability to reject embedded newlines
sometimes this is the desired behaviour, e.g. if you are prepping data for another program which won't accept embedded newlines
accept embedded binary data
the ability to use binary data (e.g. NULL chars or ^L) embedded in fields
reject embedded binary data
again, sometimes this is the desired behaviour - if you are prepping for a program that won't accept binary data, you want the parser to fail on parsing
ability to allow sparse delimiting
classic CSV uses sparse delimiting - it uses delimiters only around fields that need them, e.g. those fields that have embedded commas, newlines, or quotes; with sparse delimiting this is a valid 3-field record: foo,"bar,bop'',7
support for forced delimited writes
but some CSV styles always use delimiters for all fields, so some modules support forcing delimiters onto all fields or onto all non-numeric fields
null differentiated from empty
Text::xSV differentiates between null (undefined values) and an empty string. The other modules treat them the same.pure perl
some of the modules are pure-perl and therefore can be installed without compilation, others have C/XS componenents and require a compilation on a specific platform; the C/XS modules are generally faster than the pure perl modules
The Comparison Chart
A plus mark indicates the presence of a feabugir (feature or bug or irrelevant, depending on the context), not necessarily that it is "better'' than a minus mark.
Text::CSV | Text::xSV | Text::CSV_XS | DBD::CSV | AnyData | |
---|---|---|---|---|---|
accept newlines | - | + | + | + | * |
reject newlines | + | - | + | - | + |
accept embedded binary | - | + | + | + | + |
reject embedded binary | + | - | + | - | - |
forced delimiting | + | - | + | - | - |
sparse delimiting | - | + | + | + | + |
user-defined field sep | - | + | + | + | + |
user-defined delimiter | - | - | + | + | + |
user-defined escape | - | - | + | + | + |
user-defined record sep | - | - | + | + | + |
pure perl | + | + | - | - | + |
null handling | - | + | - | - | - |
Notes
Some of the modules accept flags which can change their default behaviour, e.g. Text::CSV_XS defaults to rejecting newlines but can easily be set to accept them by passing the "binary'' flag. In these cases, they are shown with plus marks for all possible settings.
DBD::AnyData has the same properties as AnyData (which is a multi-level tied-hash interface to the data), both accept embedded newlines only if something other than newline is used as the record separator
DBD::CSV is actually built on top of Text::CSV_XS but since it uses specific flags for Text::CSV_XS, its parsing properties are somewhat different.
update added readmore tags update2 added null handling
|
---|