SamCG has asked for the wisdom of the Perl Monks concerning the following question:

I'm not really sure if this is expected behavior or not, but the T::CSV::Simple module chokes on the following line
20050118184913996273,33301,EQ,A,"Hemkopskedjan Ab "B" Ordinary Shares +(Sweden)",1.0,US,NAS,USD,N,700,OBB,,NAS-OBB,HKPKF.OB, ,0,0.0,,,,,0.0, +,,00000000,00000000,00000000, , , ,,0.0,1e-06,EQ,N
Specifically, the embedded "B" gives it problems. Now, I don't trust the person who created this file to be following appropriate CSV format at all (his idea for solving the issue of commas embedded in data, for example, was to put columns like that on the end, he's apparently never heard of using double-quotes). And though I'm not intimately familiar with the CSV spec either, it doesn't seem like having a double-quoted internal value should be verboten. Should it?

Either way, I need to deal with this. I suspect I don't want to play around with the module (after all, I might want to parse other CSV files) unless I want to copy it and play with the copy, though that seems unwise architecturally. Any advice?

I suspect looking at code isn't necessary for anyone who can answer this, but just in case (and for completeness):

$parser->field_map(qw/TIMESTAMP INSTRUMENT_ID INSTRUMENT_TYPE STATUS D +ESCRIPTION CONSIDERATION_FACTOR COUNTRY_CODE EXCHANGE_ID CURRENCY_ID TRADED_IN_MI +NOR_CCY PRIMARY_BOOK SEGMENT_CODE SECTOR_ID PRIMARY_MARKET ISIN_CODE SETTLEMEN +T_TYPE SETTLEMENT_DAYS MIN_SIZE_OUTSIDE_SPREAD SETTLEMENT_EXCHANGE_ID INSTRUM +ENT_TYPE_QUALIFIER DELIVERY_MECHANISM COUNTRY_OF_INCORPORATION COUPON_RATE COUPON_DATE_1 +COUPON_DATE_2 ISSUE_DATE EXPIRY_DATE ACCRUED_START_DATE SHORT_FEBRUARY COUPON_TYPE A +CCRUED_CALC_TYPE DAY_COUNT_METHOD DENOMINATION TICK_SIZE PRODUCT_TYPE IS_RESEARCHED/); while ( my @csvdata = $parser->read_file($instfile)){ $desc{$csvdata{"INSTRUMENT_ID"}} = $csvdata{"DESCRIPTION"}; $status{$csvdata{"INSTRUMENT_ID"}} = $csvdata{"STATUS"}; $timestamp{$csvdata{"INSTRUMENT_ID"}} = $csvdata{"TIMESTAMP"}; $country_code{$csvdata{"INSTRUMENT_ID"}} = $csvdata{"COUNTRY_COD +E"}; $iisin{$csvdata{"INSTRUMENT_ID"}} = $csvdata{"ISIN_CODE"}; $icode{$csvdata{"INSTRUMENT_ID"}} = $csvdata{"INSTRUMENT_TYPE"}; $ids{$csvdata{"INSTRUMENT_ID"}}++; }
Yes, I know I could have set up my data structure better. This was basically a first pass.
Thanks,
SamCG

Replies are listed 'Best First'.
Re: Text::CSV::Simple parsing issue
by dragonchild (Archbishop) on Jan 27, 2005 at 18:12 UTC
    That is invalid CSV. The correct way would be to double up the quotes. So, instead of
    "Hemkopskedjan Ab "B" Ordinary Shares (Sweden)"
    it should be
    "Hemkopskedjan Ab ""B"" Ordinary Shares (Sweden)"
    You will want to get the person who's sending you the file to fix their stuff up. Otherwise, you're going to have a devil of a time dealing with it. And, no, it's not an easy fix.

    Being right, does not endow the right to be rude; politeness costs nothing.
    Being unknowing, is not the same as being stupid.
    Expressing a contrary opinion, whether to the individual or the group, is more often a sign of deeper thought than of cantankerous belligerence.
    Do not mistake your goals as the only goals; your opinion as the only opinion; your confidence as correctness. Saying you know better is not the same as explaining you know better.

Re: Text::CSV::Simple parsing issue
by jZed (Prior) on Jan 27, 2005 at 19:45 UTC
    First, since what you're doing looks suspiciously like database operations, might you be better off using DBD::CSV? It uses the same CSV engine as Text::CSV::Simple but provides a database interface for dealing with the data.

    Second, I'd concur with the other posters who point out that the data you have can't really be parsed as CSV directly, but needs to be munged first. Your best bet is to get whoever outputs the data to properly escape the embedded quotes by doubling them or putting backslash or some other escape character in front of them. If you can't get that to happen, you'll need to munge the data to put those escapes in first. Whether it's even possible to munge the data with one of the supplied regexen depends on the complexity of your data. For example, it's quite possible to have both commas and quotes embedded in a field so this is a valid CSV record of three fields:

    "a",",foo,""bar"",""baz"",","b"
    If that occurs without the escapes, it will look like this:
    "a",",foo,"bar","baz",","b"
    And I doubt there's a regex that could figure out which commas are separators and which quotes are delimiters from that.
Re: Text::CSV::Simple parsing issue
by borisz (Canon) on Jan 27, 2005 at 18:19 UTC
    A primitive, but maybe working solution is this filter:
    perl -pe 's/(?<!,)"(?!,)/""/g' <in.csv >out.csv
    try it, good luck.
    Boris
      s/(?<!,)"(?![,"])/""/g

      You don't want to screw up the doubled double-quotes that are already correctly done.

      Being right, does not endow the right to be rude; politeness costs nothing.
      Being unknowing, is not the same as being stupid.
      Expressing a contrary opinion, whether to the individual or the group, is more often a sign of deeper thought than of cantankerous belligerence.
      Do not mistake your goals as the only goals; your opinion as the only opinion; your confidence as correctness. Saying you know better is not the same as explaining you know better.

Re: Text::CSV::Simple parsing issue
by BUU (Prior) on Jan 27, 2005 at 19:07 UTC
    Hrm. What about:
    s/(?<!,)"(?!,)/\\"/g;
    Which works at long as there are no spaces between the comma and the starting/ending quotes, since you can't add a variable length negative look behind, you'll just have to add spaces if necessary.