Beefy Boxes and Bandwidth Generously Provided by pair Networks
Pathologically Eclectic Rubbish Lister
 
PerlMonks  

A better way to split CSV files with quoted strings that may contain commas?

by rkaminski (Initiate)
on May 31, 2006 at 16:37 UTC ( [id://552860]=perlquestion: print w/replies, xml ) Need Help??

rkaminski has asked for the wisdom of the Perl Monks concerning the following question:

I'm using a split command to parse a comma separated value (CSV) file (there are no standard field sizes, they vary a lot) that is produced from a vendor product. It has worked well for years. The format is something like: ($semi_thirty_char_node_key,undef,undef,undef,$mem_size,$spec,$num_proc,undef,undef,undef,$mode,undef)= split /,/, $inline; Recently the vendor started using single quoted strings in the data like 'IBM machine' or worse 'IBM,16576523'. Note the comma in the second one that throws off my split. I'm not in love with split, I'll accept any valid perlish way that I can understand. What I need it is a way to parse a CSV file that knows to ignore commas within single quoted strings. Any thoughts or areas to study that may yield a solution will be deeply appreciated.
  • Comment on A better way to split CSV files with quoted strings that may contain commas?
  • Download Code

Replies are listed 'Best First'.
Re: A better way to split CSV files with quoted strings that may contain commas?
by Zaxo (Archbishop) on May 31, 2006 at 16:46 UTC

    This link, CSV, will search CPAN and find a pile of good modules for working with csv files. DBD::CSV has the advantage that DBI works with it. That will ease the transition to other kinds of databases.

    After Compline,
    Zaxo

Re: A better way to split CSV files with quoted strings that may contain commas?
by Joost (Canon) on May 31, 2006 at 16:42 UTC
Re: A better way to split CSV files with quoted strings that may contain commas?
by gellyfish (Monsignor) on May 31, 2006 at 18:42 UTC

    As others have already pointed out the "correct" way to deal with CSV files is to use one of the available modules, however this is a FAQ and a way of doing this by hand is discussed there.

    /J\

      Wow! What a timely tidbit of information. I was just pondering the same problem and your link was most helpful. But as I discovered from the FAQ link, the example doesn't handle extra spacing (around commas) very well.

      Here's my 'space fixed' version, with checking for single quotes as well:
      use strict; use warnings; # crazy mix of quoting my $test1 = q/, "test with space ",,, mary had , a,, 'cake, with +cheese' , and, "a, \"little" , lamb chop , /; # original FAQ example my $test2 = q/SAR001,"","Cimetrix, Inc","Bob Smith","CAM",N,8,1,0,7,"E +rror, Core Dumped"/; # throw some wild spaces in there my $test3 = q/SAR001, "", "Cimetrix, Inc", "Bob Smith", "CAM",N, 8 ,1, +0 ,7, "Error, Core Dumped"/; # finally a nearly empty string my $test4 = q/,/; split_string($test1); split_string($test2); split_string($test3); split_string($test4); sub split_string { my $text = shift; my @new = (); push(@new, $+) while $text =~ m{ \s*( # groups the phrase inside double quotes "([^\"\\]*(?:\\.[^\"\\]*)*)"\s*,? # groups the phrase inside single quotes | '([^\'\\]*(?:\\.[^\'\\]*)*)'\s*,? # trims leading/trailing space from phrase | ([^,\s]+(?:\s+[^,\s]+)*)\s*,? # just to grab empty phrases | (), )\s*}gx; push(@new, undef) if $text =~ m/,\s*$/; # just to prove it's working print "string: >>$text<<\n"; foreach (@new) { print " part: >>" . (defined($_) ? $_ : '') . "<<\n"; } }
Re: A better way to split CSV files with quoted strings that may contain commas?
by derby (Abbot) on May 31, 2006 at 19:23 UTC

    Recently the vendor started using single quoted strings in the data like 'IBM machine' or worse 'IBM,16576523'

    Although not an official standard, you may want to point your vendor to RFC 4180 or the How To or even Wikipedia. They should really be using double quotes, not single. I know most of the CSV modules support double quotes ... I wouldn't bet on them supporting single quotes.

    -derby
Re: A better way to split CSV files with quoted strings that may contain commas?
by rkaminski (Initiate) on Jun 01, 2006 at 16:38 UTC
    UPDATE from rkaminski: I really appreciate all the replies, Wow! The vendor does not use full CSV format, or CSV header records, and hasn't for 20 years, and doesn't plan to, so all of the module techniques mentioned, while certainly the correct way to go in true CSV situations, do not seem to apply to this admittedly strange special case. I did learn a lot about proper CSV techniques, and for that I thank you all. However, the magical regular expression idea from the FAQ, (modified slightly to use ' instead of ") worked great. Here is the updated code, which works well enough, although I suspect that there is a more perlish way to do it. I was also really impressed by the contributor (ruzam) who modified it even farther to deal with ', " and extra spaces, which surpassed my needs but was excellent for folks who have even odder input than I do!
    @inline=(); push(@inline, $+) while $pre_inline =~ m{'([^\'\\]*(?:\\.[^\'\\]*) +*)',? | ([^,]+),?| , }gx; push(@inline, undef) if substr($pre_inline,-1,1) eq ','; $semi_thirty_char_node_key = $inline[0]; $mem_size = $inline[4]; $spec = $inline[5]; $num_proc = $inline[6]; $model = $inline[10]; $os = $inline[11]; $alias = $inline[14]; $perf_rate_type = $inline[15]; $node_mode = $inline[18]; $num_logical_processors = $inline[19];
    The input data was:
    ;pdbdevd1 Olx8jQCKrnQvrA0,'IBM eServer p5 595 1900MH',pdbdevd1,I +BM_P5_595_1900,4096.000000,394.848511,27,,,,IBM_p5_595_1900,AIX,,,pdb +devd1,SPECINTRATE2000,EST,'IBM,02028C8AD',LOGICAL,2,DLPAR,SMT
    So, instead of using the split with undefs I resorted to indexing off of $inline, but it worked! granted, timtowtdi, but if you have a more elegant way, please comment. Thanks to all who helped so far! I will try to help others in the future too!
Re: A better way to split CSV files with quoted strings that may contain commas?
by pajout (Curate) on Jun 01, 2006 at 10:20 UTC
    Just for enjoy, the better way is to use previously adviced modules.

    When you split the row by comma, you must check the count of quotes in every array item, from 0th to Nth. When this count is odd, you must join that item with the next item and test the count again. If the item with odd count of quote is the last, you must add split items from the next row (it occurs when \n in strings).

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://552860]
Approved by Joost
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others learning in the Monastery: (6)
As of 2024-04-25 08:10 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found