EyeOpener has asked for the wisdom of the Perl Monks concerning the following question:

Hello monks,

I need to read in a CSV file, do some munging, and write out the alterations to a new CSV file. I'd like to use Text::CSV for this, but the problem is this particular CSV file uses double-quotes only around fields it has defined as "text fields" (although the contents of a particular text field may actually be a number). To make matters worse, the fields can change from one line to the next. (Actually, I have file layouts that would help me determine which lines are going to have which field definitions, but if possible, I'd like to find a "generic" solution that will work with other files from this same application.)

So, I may need to read in lines like:

"hi",3,20.6,"green","32" 16,"alpha",0.00
and write them out as
"bye",3,27.6,"green","32" 16,"beta",0.00
As far as I can tell, Text::CSV can handle all of this except writing out with quotes the way I need. It either writes everything with quotes or nothing. For my purposes, if the field had quotes coming in, it needs quotes going out.

I don't want to write my own CSV parser for this. Can I still use Text::CSV somehow, or is there another perlish solution?

-E

Replies are listed 'Best First'.
Re: Writing CSV files
by tachyon (Chancellor) on Oct 28, 2004 at 23:24 UTC

    Seems like a somewhat strange requirement. Presumably the application is dumb as a box of hammers and is quoting any internal *char type. Essentially you need to remember if a given field was quoted or not which is typically information that CSV parsers discard. You really need a specialised split that returns array refs for the fields where the first element is the field data and the second the quote status. Here is the basic idea:

    #!/usr/bin/perl -w use strict; while( my $data = <DATA> ) { chomp($data); next unless $data; my $fields = get_fields( $data ); $fields->[rand(3)]->[0] = 'newval'; my $str = remap_fields( $fields ); print "$str\n"; } sub get_fields { [ map{ s/^"|"$//g ? [ $_, 1 ] : [ $_, 0 ] } split /(?<=\S),(?=\S)/ +, $_[0] ]; } sub remap_fields { join ',', map{ $_->[1] ? qq!"$_->[0]"! : $_->[0] } @{$_[0]}; } __DATA__ "hi",3,20.6,"green","32" 16,"alpha",0.00

    The get_fields() function may or may not work for you. The logic it uses not to split on embedded commas is to only split on \S,\S, This is making the invalid assumption that embeded commas probably look, like, this ie have spaces. You may have no embedded commas or the assumption may be true. If it is not you will need a better parser.

    cheers

    tachyon

Re: Writing CSV files
by Joost (Canon) on Oct 28, 2004 at 23:14 UTC

      It is not really that weird. Some of the application that is type-aware, do generate CSV file like this, especially database applications, adn Text::CSV is ready to handle this, both in and out.

Re: Writing CSV files
by ikegami (Patriarch) on Oct 28, 2004 at 23:32 UTC

    Like this?

    sub dequote { local $_ = @_ ? $_[0] : $_; return $_ unless (s/^"(.*)"$/$1/s); s/\\(.)/$1/gs; return $_; } my @data; while (<DATA>) { chomp; my @fields; $_ = $_ . ','; push(@fields, $1) while (/\G("(?:[^"\\]|\\.)*"|[^,]*)(?:,)/g); @fields = map { dequote } @fields; push(@data, \@fields); } require Data::Dumper; print(Data::Dumper::Dumper(\@data)); __DATA__ "hi",3,20.6,"green","32" 16,"alpha",0.00
Re: Writing CSV files
by dragonchild (Archbishop) on Oct 29, 2004 at 00:38 UTC
    I'm pretty sure, having read the code, that you can subclass Text::xSV to do what you need. Or, rather, since tilly didn't write it to be subclassed, you might have to do some sort of redefinition of some of the functions, but it shouldn't be that painful. If you want some help, give us a working example of what you need and we can help figure it out.

    Being right, does not endow the right to be rude; politeness costs nothing.
    Being unknowing, is not the same as being stupid.
    Expressing a contrary opinion, whether to the individual or the group, is more often a sign of deeper thought than of cantankerous belligerence.
    Do not mistake your goals as the only goals; your opinion as the only opinion; your confidence as correctness. Saying you know better is not the same as explaining you know better.

Re: Writing CSV files
by pg (Canon) on Oct 28, 2004 at 23:27 UTC

    With Text::CSV, if you want to embed double-quote, represent it by a pair of consecutive double-quotes.

Re: Writing CSV files
by steves (Curate) on Oct 29, 2004 at 09:50 UTC

    The default behavior for Text::CSV_XS works as you want things to. Example:

    use strict; use Text::CSV_XS; use IO::File; my $csv = Text::CSV_XS->new(); my @row; while (<DATA>) { chomp; @row = split(/\|/, $_); $csv->print(\*STDOUT, \@row); print "\n"; } __DATA__ Field1|Field Number Two|Field Number Three|Field4 Another One|foo|fubar|foo and fubar
    produces:
    Field1,"Field Number Two","Field Number Three",Field4 "Another One",foo,fubar,"foo and fubar"

    I've found Text::CSV_XS to be the most accurate at generating true CSV, versus just delimited by commas, which some people seem to think is CSV.

      The default behavior for Text::CSV_XS works as you want things to.

      Actually what you have shown and what Text::CSV_XS does has no relationship whatsoever to what the OP wanted.

        You are correct tachyon. It took me three cups of coffee to see that. My apologies. The issue here is rather custom since none of the CSV packages I've used will tell you if something was quoted coming in -- they hide that. It seems like to do this right you need to know what was quoted coming in and mirror that on the way out. If the rules are based on content (e.g., something that can be regexp matched) then maybe you don't need to know how it was originally.

Re: Writing CSV files
by jZed (Prior) on Oct 29, 2004 at 15:32 UTC
    As far as I can tell, Text::CSV can handle all of this except writing out with quotes the way I need. It either writes everything with quotes or nothing.
    Text::CSV_XS takes a different approach. By default it quotes only fields that need to be quoted for parsability i.e. it quotes fields that have embedded separator or delimiter characters and does not quote fields that do not contain them i.e. foo,7,bar,"x,y",qux. You can also specify always quote to get quote characters delimiting all fields.

    You might want to examine your data to make sure that isn't what is happening.

    If your fields are not quoted according to their parseability, then it is most likely that they are quoted by field type (as understood by the original app) so that they will be consistent *by field*. In other words if fields 1, 4, and 5 are thought by the original app to contain strings, then fields 1, 4, and 5 will be quoted. Here's a script which will honor the original apps' field definitions - it reads CSV fields (including embedded commas) and writes out the fields quoted as in the original. It uses AnyData, another CSV-capable module. The example uses a DATA section, but the module also supports reading from files.

    #!/usr/bin/perl -w use strict; use AnyData; my %is_quoted_field = ( c1=>1, c4=>1, c5=>1 ); my @cols = qw ( c1 c2 c3 c4 c5 c6 ); my $table = adTie('CSV',[<DATA>],'r',{cols=>[@cols]}); while (my $row = each %$table) { my @fields; for (@cols) { my $field = ($is_quoted_field{$_}) ? q{"} . $row->{$_} . q{"} : $row->{$_}; push @fields,$field; } print join(',',@fields)."\n"; } __DATA__ "bye",3,27.6,"green","32",stuff "x,y",5,28.2,"yellow","33",more stuff
Re: Writing CSV files
by TilRMan (Friar) on Oct 29, 2004 at 15:07 UTC
    Here's a quick subclass of Text::CSV_XS to return you an opaque $quote object from getline() that you pass to print(). Two gotchas though: First, commas embedded in a field screw it up. Second, I only implemented new, getline(), and print(); not combine() or parse() or anything else.

    If the strings can have commas, I don't know if Text::CSV_XS can help you.

Re: Writing CSV files
by EyeOpener (Scribe) on Oct 29, 2004 at 17:22 UTC
    Many thanks to tachyon, TilRMan, jZed and others for the helpful responses! Yes, this is an ancient ('70's era) app that has its own bizarre (but consistent) logic for maintaining these files. Unfortunately, it does allow for both commas and quote characters within quoted text fields, which is why I didn't want to get into the business of writing my own parser.

    As I mentioned, I do have file layouts detailing the datatypes of each field I'm going to encounter. I think the approach I'll take will be to read the file layouts into a data structure, and re-quote the text fields when they're in an array. That way I can still use the well-tested Text::CSV or Text::CSV_XS for parsing and rewriting the CSVs. It'll take more work, but that's probably the most bulletproof and flexible approach.

    Again, thanks!

    -E