bulk88 has asked for the wisdom of the Perl Monks concerning the following question:

I have some interestingly formatted CSV I need to get into a Perl hash. One of the columns uses "s as the quote char, all other columns use 's. The " column has 's in it. See the "JOHNS FLYING DOG" row. The last time this data format was used, I wrote text editor regexes to turn the CSV data into a literal perl hash of array refs, then processes that data structure into a hash of hashes. I'd like a better solution than the previous one (I also lost the previous solution). How do I get this CSV into a Perl hash? I am not sure Text::CSV has the options to do it, and I couldn't easily find any other CSV parser engines on CPAN (everything seems to be wrappers around Text::CSV or CSV_XS).
use Text::CSV::Hashify; use Data::Dumper; my $filename = 'bad.csv'; $obj = Text::CSV::Hashify->new( { file => $filename, format => 'hoh', key => "PRODUCT CODE", quote_char => "'", escape_char => "|", #wont be found in the data, turn off +escaping auto_diag => 1, } ); print Dumper($obj->all);
The CSV file
'PRODUCT CODE','CATEGORY','CATEGORY DESCRIPTION','CODE DESCRIPTION','O +PTIONAL CATEGORY','OPTIONAL CATEGORY DESCRIPTION' ' ','0 ','No Item',"INVALID CODE IN USER SUPPLIED DATA",' ',' ' '00100','1 ','Cat',"ORANGE CAT",' ',' ' '82131','94 ','Dog',"GREEN DOG",' ',' ' '82132','94 ','Dog',"'JOHNS' FLYING' DOG (Start 2001)",' ',' ' '82133','94 ','Dog',"MAGENTA DOG (End 2009)",' ',' '
The error message
# CSV_XS ERROR: 2034 - EIF - Loose unescaped quote @ rec 4 pos 24 $VAR1 = { ' ' => { 'CODE DESCRIPTION' => '"INVALID CODE IN USER SU +PPLIED DATA"', 'OPTIONAL CATEGORY DESCRIPTION' => ' ', 'CATEGORY' => '0 ', 'PRODUCT CODE' => ' ', 'CATEGORY DESCRIPTION' => 'No Item', 'OPTIONAL CATEGORY' => ' ' }, '82131' => { 'CODE DESCRIPTION' => '"GREEN DOG"', 'CATEGORY' => '94 ', 'OPTIONAL CATEGORY DESCRIPTION' => ' ', 'PRODUCT CODE' => '82131', 'OPTIONAL CATEGORY' => ' ', 'CATEGORY DESCRIPTION' => 'Dog' }, '00100' => { 'OPTIONAL CATEGORY' => ' ', 'CATEGORY DESCRIPTION' => 'Cat', 'CODE DESCRIPTION' => '"ORANGE CAT"', 'CATEGORY' => '1 ', 'OPTIONAL CATEGORY DESCRIPTION' => ' ', 'PRODUCT CODE' => '00100' } };

Replies are listed 'Best First'.
Re: parsing malformed CSV with per column quote chars
by Tux (Canon) on Aug 07, 2014 at 18:04 UTC

    Update. Can reproduce

    $ cat test.csv 'PRODUCT CODE','CATEGORY','CATEGORY DESCRIPTION','CODE DESCRIPTION','O +PTIONAL CATEGORY','OPTIONAL CATEGORY DESCRIPTION' ' ','0 ','No Item',"INVALID CODE IN USER SUPPLIED DATA",' ',' ' '00100','1 ','Cat',"ORANGE CAT",' ',' ' '82131','94 ','Dog',"GREEN DOG",' ',' ' '82132','94 ','Dog',"'JOHNS' FLYING' DOG (Start 2001)",' ',' ' '82133','94 ','Dog',"MAGENTA DOG (End 2009)",' ',' ' $ perldoc -m CSV | head -7 use Text::CSV_XS qw( csv ); use Data::Peek; BEGIN { *CSV:: = \%Text::CSV_XS::; } $VERSION = "0.02"; 1; $ perl -MCSV -e'DDumper csv(in=>"test.csv",diag_verbose=>9,quote_char= +>"\x27",escape_char=>undef)' # CSV_XS ERROR: 2034 - EIF - Loose unescaped quote @ rec 4 pos 24 '82132','94 ','Dog',"'JOHNS' FLYING' DOG (Start 2001)",' ',' ' ' ^ [ [ 'PRODUCT CODE', 'CATEGORY', 'CATEGORY DESCRIPTION', 'CODE DESCRIPTION', 'OPTIONAL CATEGORY', 'OPTIONAL CATEGORY DESCRIPTION' ], [ ' ', '0 ', 'No Item', '"INVALID CODE IN USER SUPPLIED DATA"', ' ', ' ' ], [ '00100', '1 ', 'Cat', '"ORANGE CAT"', ' ', ' ' ], [ 82131, '94 ', 'Dog', '"GREEN DOG"', ' ', ' ' ] ]

    You need allow_loose_quotes:

    perl -MCSV -e'DDumper csv(in=>"test.csv",diag_verbose=>9,quote_char=>" +\x27",escape_char=>undef,allow_loose_quotes=>1)' [ [ 'PRODUCT CODE', 'CATEGORY', 'CATEGORY DESCRIPTION', 'CODE DESCRIPTION', 'OPTIONAL CATEGORY', 'OPTIONAL CATEGORY DESCRIPTION' ], [ ' ', '0 ', 'No Item', '"INVALID CODE IN USER SUPPLIED DATA"', ' ', ' ' ], [ '00100', '1 ', 'Cat', '"ORANGE CAT"', ' ', ' ' ], [ 82131, '94 ', 'Dog', '"GREEN DOG"', ' ', ' ' ], [ 82132, '94 ', 'Dog', '"\'JOHNS\' FLYING\' DOG (Start 2001)"', ' ', ' ' ], [ 82133, '94 ', 'Dog', '"MAGENTA DOG (End 2009)"', ' ', ' ' ] ]

    To get (with Text::CSV_XS) hashes, you need headers and a map:

    $ perl -MCSV -e'DDumper{map{$_->{"PRODUCT CODE"}=>$_}@{csv(in=>"test.c +sv",diag_verbose=>9,quote_char=>"\x27",escape_char=>undef,allow_loose +_quotes=>1,headers=>"auto")}}' { ' ' => { CATEGORY => '0 ', 'CATEGORY DESCRIPTION' => 'No Item', 'CODE DESCRIPTION' => '"INVALID CODE IN USER SUPPLIED DATA"', 'OPTIONAL CATEGORY' => ' ', 'OPTIONAL CATEGORY DESCRIPTION' => ' ', 'PRODUCT CODE' => ' ' }, '00100' => { CATEGORY => '1 ', 'CATEGORY DESCRIPTION' => 'Cat', 'CODE DESCRIPTION' => '"ORANGE CAT"', 'OPTIONAL CATEGORY' => ' ', 'OPTIONAL CATEGORY DESCRIPTION' => ' ', 'PRODUCT CODE' => '00100' }, 82131 => { CATEGORY => '94 ', 'CATEGORY DESCRIPTION' => 'Dog', 'CODE DESCRIPTION' => '"GREEN DOG"', 'OPTIONAL CATEGORY' => ' ', 'OPTIONAL CATEGORY DESCRIPTION' => ' ', 'PRODUCT CODE' => 82131 }, 82132 => { CATEGORY => '94 ', 'CATEGORY DESCRIPTION' => 'Dog', 'CODE DESCRIPTION' => '"\'JOHNS\' FLYING\' DOG (Start 2001)"', 'OPTIONAL CATEGORY' => ' ', 'OPTIONAL CATEGORY DESCRIPTION' => ' ', 'PRODUCT CODE' => 82132 }, 82133 => { CATEGORY => '94 ', 'CATEGORY DESCRIPTION' => 'Dog', 'CODE DESCRIPTION' => '"MAGENTA DOG (End 2009)"', 'OPTIONAL CATEGORY' => ' ', 'OPTIONAL CATEGORY DESCRIPTION' => ' ', 'PRODUCT CODE' => 82133 } }

    Enjoy, Have FUN! H.Merijn

      You'll need to add a post-processing step to remove the double quotes. And you'll have to hope that none of the in-double-quotes fields contain any comma characters.

      - tye        

        Yes, the usual drawbacks. I was merely answering on the required options to make that work.


        Enjoy, Have FUN! H.Merijn
Re: parsing malformed CSV with per column quote chars (SMoP)
by tye (Sage) on Aug 07, 2014 at 18:28 UTC

    It isn't CSV. So just write a parser for it. This isn't rocket surgery.

    #!/usr/bin/perl -w use strict; my @data; while( <DATA> ) { my @row; while( /\G(?=.)/gc ) { my $val = undef; /\G\s*/gc; if( /\G'/gc ) { my $p = pos(); /\G(?:[^']+|'')*/gc; $val = substr( $_, $p, pos()-$p ); die "Unclosed '\n" if ! /\G'/gc; } elsif( /\G"/gc ) { my $p = pos(); /\G(?:[^"]+|"")*/gc; $val = substr( $_, $p, pos()-$p ); die "Unclosed \"\n" if ! /\G"/gc; } else { my $p = pos(); $val = substr( $_, $p, pos()-$p ) if /\G[^'",]+/gc; $val =~ s/\s*$//; } /\G\s*/gc; die "Bad data\n" if ! /\G(,|$)/gc; push @row, $val; } push @data, \@row; } __END__ 'PRODUCT CODE','CATEGORY','CATEGORY DESCRIPTION','CODE DESCRIPTION','O +PTIONAL CATEGORY','OPTIONAL CATEGORY DESCRIPTION' ' ','0 ','No Item',"INVALID CODE IN USER SUPPLIED DATA",' ',' ' '00100','1 ','Cat',"ORANGE CAT",' ',' ' '82131','94 ','Dog',"GREEN DOG",' ',' ' '82132','94 ','Dog',"'JOHNS' FLYING' DOG (Start 2001)",' ',' ' '82133','94 ','Dog',"MAGENTA DOG (End 2009)",' ',' '

    Yep, not hard; worked the first try. Took a few minutes to write.

    (Update: I neglected to post-process escaped quotation marks. Of course, no provision for escaped quotation marks was given in the original problem so I just implemented the simplest version, which might not be appropriate.)

    - tye        

Re: parsing malformed CSV with per column quote chars
by dasgar (Priest) on Aug 07, 2014 at 18:19 UTC

    Just did a quick search on metacpan for "tie hash csv". It looks like Tie::Handle::CSV might be doing something along the lines of what you want. After doing a quick scan through its documentation, it looks like it shouldn't be too difficult to create a hash using this module.

    If I understand your previously used method correctly, Tie::Array::CSV can easily parse out a CSV file into an AoA structure - or as I like to think of it, a two dimensional array. It also allows you to pass parameters to Tie::File and Text::CSV, which it uses to do all of the real work.