Re: problems parsing CSV

As you well know, there’s a “Catch-22” here. With over a million records to process, there is no practical way to hand-inspect every one. Your efforts to make the data “able to be processed” can also, if you are not cautious, enable incorrect data to be processed ... or for data to be processed incorrectly such that “you would never know.”

(It’s the quintessential “Type-1 / Type-2 Errors” principle from Stats class. In situations such as these, “incorrectly accepting ‘false’ data” is by far the more damaging, because if the computer itself does not catch the error, no one will.)

Consider writing a defensive, suspicious process that vets the data first, line-by-line, resolving valid issues and throwing erroneous records out (into a separate “garbage bucket” file). If any records get thrown-out in this way, stop and scream. Otherwise, process the records ... using CPAN modules to your best advantage as you are able.

Like all contributed code, CPAN modules are (to some degree) “designed for the general case,” and sometimes that is a good thing but sometimes it does not work out so well. As a last resort (so to speak), sometimes CPAN modules work well as “a source of inspiration.” And that is a legitimate purpose, too.