in reply to Re^2: problems parsing CSV
in thread problems parsing CSV

Removing diacritics from characters alters the data you're parsing. Are you supposed to do that? Probably not. You certainly shouldn't have to modify the data for any reason.

What character encoding is the text in? ISO 8859-1 (Latin 1)? Windows-1252? UTF-8? If you don't know, I encourage you to find out. You really ought to know.

I suspect your few problems — malformed CSV records and text that isn't in the ASCII character encoding — can be solved by using Text::CSV::Encoded and also enabling allow_loose_quotes as others have recommended.

UPDATE: Using Text::CSV::Encoded may be overkill. Jenda's recommendation to set the binary attribute to true (1) may be all you really need. But I nonetheless still believe you ought to know what character encoding the text is in.

Replies are listed 'Best First'.
Re^4: problems parsing CSV
by helenwoodson (Acolyte) on Oct 10, 2010 at 10:56 UTC

    Good point. It would be better to avoid removing the diacritics if possible. I did find that setting the binary attribute to true for Text::CSV did prevent the script from choking on the diacritics.

    I do not know the character encoding and don't know how to identify it, so I asked Mr. Google (Mr. Google knows all!) and am looking through what he dredged up. I looked at the documentation in CPAN for Text::CSV::Encoded. It appears that, in order to use this, you need to know the enocoding for the input and what you want to use for the output.

    I seem to have it working fairly well except for some of the cases where the weight is "" or 0.0. I haven't yet figured out why it works correctly for some records and not for others. I will look at the records where it fails and see if I can identify that.

    Thanks very much.