in reply to regular expression (search and destroy)

As you have had all the obligatory warnings about not using a module, or at least copying from a module to do this, let me be the one to caution you that if you opt to use a module, look at them very carefully. They are not all equal.

The first thing to check for is that the modules idea of what constitutes CSV data, is the same as Excel's idea. For example, Excel can generate CSV data with quoted fields that contain embedded newlines. And don't blame MS for this extension to the standard (if you can find a standard definition for CSV), many other spreadsheets also do this going right back to the once ubiquitous Lotus 123 I believe. To date, Tillys Text::xSV is the only module I found that will handle this.

If you have large volumes of CSV to parse, many of the CSV modules around are less than sparkling in the performance department. The best performer I have found is Text::CSV_XS, but it fails to handle embedded newlines. In any case, if you cannot or will not install modules, being XS, it will not be useful to you.

It is possible to do this yourself with regexes, but it is quite difficult to get it right and cover all the edge cases.


Examine what is said, not who speaks.
"Efficiency is intelligent laziness." -David Dunham
"Think for yourself!" - Abigail
Hooray!
Wanted!

  • Comment on Re: regular expression (search and destroy)

Replies are listed 'Best First'.
Re: Re: regular expression (search and destroy)
by giulienk (Curate) on Nov 13, 2003 at 07:32 UTC
    Text::CSV_XS does handle embedded new lines indeed, you just have to configure your new object properly using the binary option.

    I quote from the man page:

    binary
    
        If this attribute is TRUE, you may use binary characters in quoted fields,
        including line feeds, carriage returns and NUL bytes. (The latter must
        be escaped as "0.) By default this feature is off.
    

    I find Text::CSV_XS to be a nice solution, I never had problems once I set up object attributes correctly. Performance wise is ligthing speed, especially using the print and getline methods.


    $|=$_="1g2i1u1l2i4e2n0k",map{print"\7",chop;select$,,$,,$,,$_/7}m{..}g