Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

I'm reading a CSV file for it's values, which is fairly simple, but I need to allow for commas to be passed through as data as well and therefore allow escaping. So for example I want: foo,bar,foo\,bar to be read as 3 values and not 4. At first I thought I could accomplish this by using a negative look behind assertion, but then I realized that I had problems if my values contained legitimate back slashes (which must also now be escaped.

Please do not recommend a perl module for this task, as this is just as much about the technique as it is the task.

Thanks in advance!

Replies are listed 'Best First'.
Re: Reading escaped data from a CSV
by jeffa (Bishop) on Nov 14, 2003 at 20:39 UTC
    You should use quotes:
    foo,bar,"foo,bar"
    to differentiate, not an escaped comma. (if you would look at the source for Text::CSV or Text::xSV you'd get some ideas)

    UPDATE:
    note to $self -- stop recommending Text::CSV and start recommending Text::CSV_XS (which is what i use anyway) :/

    jeffa

    L-LL-L--L-LL-L--L-LL-L--
    -R--R-RR-R--R-RR-R--R-RR
    B--B--B--B--B--B--B--B--
    H---H---H---H---H---H---
    (the triplet paradiddle with high-hat)
    
Re: Reading escaped data from a CSV
by Ovid (Cardinal) on Nov 14, 2003 at 20:42 UTC

    Anonymous Monk wrote: Please do not recommend a perl module for this task, as this is just as much about the technique as it is the task.

    Well, if you want to get this right, don't forget to include proper handling of newlines in CSV fields (Text::CSV gets this wrong, last I checked). Also, what if the escape is escaped? You'll want to deal with that. And what if you have "quoted commas, don't split".

    In other words, that could be a fair amount of "technique" to get right, or you could just use a good module, which is a better "technique". Regular expressions are great, but they tend to be overused.

    Cheers,
    Ovid

    New address of my CGI Course.

      >>Well, if you want to get this right, don't forget to include proper handling of newlines in CSV fields (Text::CSV gets this wrong, last I checked).

      Text::CSV never attempted to or claimed to deal with embedded newlines. Nor is it a maintained module. It was superceded by Text::CSV_XS long ago. Text::CSV_XS (as oppossed to Text::CSV) handles newlines if the flag binary=1 is passed. Modules like DBD::CSV which use Text::CSV_XS set that by default and therefore support embedded newlines by default.

      And, not to nitpick, but I think "wrong" is a bit harsh. There is no standard definition of CSV so there's no standard to be wrong against. But certainly it's more limited in what it can handle, one of the reasons that Text::CSV_XS came into existence.

        Most people consider Microsoft Office compatibility to be enough of a standard that that is what they shoot for.
Re: Reading escaped data from a CSV
by Cody Pendant (Prior) on Nov 14, 2003 at 21:05 UTC
    Please do not recommend a perl module for this task, as this is just as much about the technique as it is the task.

    I think you'd get better responses if you stopped saying "this is CSV and I want to deal with it" and just said "what regular expression/script can I use to split on commas, except between quotes" and so on.

    Everyone wants to say "use a module" to deal with CSV because all these problems have long ago been fixed. And if you really want to understand the way to script the reading of CSV, why don't you at least read the module?



    ($_='kkvvttuubbooppuuiiffssqqffssmmiibbddllffss') =~y~b-v~a-z~s; print
Re: Reading escaped data from a CSV
by jdtoronto (Prior) on Nov 14, 2003 at 21:09 UTC
    I would second the motion to use Text::CSV_XS For the last four or five years I have been handling CSV files all day in a huge variety of variants. For some time I ysed a homme-grown module, but it was never quite the success that this one has been! If you want to do it yourself, have a look at the source code, I am sure you will see a few possible gotcha's you have never thought of!

    jdtoronto

Re: Reading escaped data from a CSV
by jZed (Prior) on Nov 14, 2003 at 22:43 UTC
    While Text::CSV_XS is best for production use, it won't do you much good as an example of parsing CSV unless you want to look at the C code in it. For pure perl, you might want to look at AnyData::Format::CSV which uses a modifed Friedl regex to support variable delimiters, escape characters, etc. But note: it does not (yet) support embedded newlines unless something other than a single newline is used as a record separator.
      A better pure Perl example to get ideas on how to parse is probably Text::xSV. It does handle embedded newlines. Plus the strategy of using /g matches and pos is massively easier to get your head around than big and nasty regular expressions. (And extends to more complex problems.)
Re: Reading escaped data from a CSV
by hanenkamp (Pilgrim) on Nov 14, 2003 at 22:40 UTC

    First my rant: No offense monks, but some of the answers I've read to this question are kind of pathetic. I don't care whether or not there is a module or not, but sometimes it is helpful to know the technique. Sheesh. Feel free to make comments about using modules, but at least give him/her some direction besides, "No one does this anymore, just use the module or forget it!" Reinventing the wheel is a time-honored process for learning. Nice way to be rude to a potential JAPH.

    Anonymous, my first suggestion is to browse CPAN for the modules that have been mentioned already by others. At http://search.cpan.org/ you can view the documentation of modules and then click on the "Source" links to view the actual source. Then, you can see how the module author has done it.

    If you are interested in really doing this yourself, which I, like the others, would not recommend. You can use a negative lookup in a split to use escaped commas:

    my @fields = split /(?<!\\),/, $line;

    Another solution is to use a character which you do not allow in your strings, which is why characters like ^ or ~ or | are often used as delimiters in CSV files.

    The most common solution, however, is to use quotes around strings to escape an entire string. That is, commas are only available for splitting outside of quotes. This is difficult enough that I don't, off-hand, have a simple solution, though I've seen a few...for this, I would definitely suggest browsing the source of a pertinent module.