Regexp nightmare with CSV

Pingu has asked for the wisdom of the Perl Monks concerning the following question:

I'm writing a small script for my running club's site to allow searchable race results.

The data is stored in a CSV format file which I split(/,/) on commas into an array. This is all great except some of the data fields contain commas which I want to keep:

The file looks something like this:

1,"FirstName","Surname","Running club, Country",hh:mm:ss
2,"etc.","etc.","Different club, same country",hh:mm:ss
[download]

The commas in the 4th field are to be kept, not split on.

I've tried:

while (<FP>) {
    s/"(.+?),(.+?)"/g;
    (@row) = split(/,/);
}
[download]

but it doesn't work - it picks up the wrong commas. Can anyone help please?

I have a feeling that I need a non-backtracking pattern but I can't suss it.

Thanks folks,

Pingu

Edited 2001-05-28 by Ovid

Comment on Regexp nightmare with CSV Select or Download Code

Replies are listed 'Best First'.
Re: Regexp nightmare by petdance (Parson) on May 28, 2001 at 20:13 UTC
You want Text::CSV_XS. It works wonderfully, and is flexible as you can want it. Embedded carriage returns? Comma separators are actually pipes? No problem. (And, as an aside, this question just reinforces my thoughts about needing a corrolary to TMTOWTDI.) xoxo, Andy %_=split/;/,".;;n;u;e;ot;t;her;c; ". # Andy Lester 'Perl ;@; a;a;j;m;er;y;t;p;n;d;s;o;'. # http://petdance.com "hack";print map delete$_{$_},split//,q< andy@petdance.com >	[reply]
Re: Regexp nightmare by Coyote (Deacon) on May 28, 2001 at 20:14 UTC
I would recommend checking out one of the CSV modules on CPAN rather than rolling your own. Possible candidates: Text::CSV Text::CSV_XS DBI and DBD::CSV ---- Coyote	[reply]
Re (tilly) 2: Regexp nightmare by tilly (Archbishop) on May 28, 2001 at 21:05 UTC
Text::CSV cannot handle embedded returns, nor is its API consistent with handling them. For a pure Perl solution that does handle embedded returns correctly you can try Text::xSV.	[reply]
Re: Re (tilly) 2: Regexp nightmare by shotgunefx (Parson) on May 29, 2001 at 00:29 UTC
Do you mean CR or CRLF in the fields? The way I always get around it with Text::CSV_XS is to treat it like an MS-DOS/Win32 text file. `# Code that writes CSV out. $csvstring=~s/\cM\cJ/\cM/g; print SH $string."\cM\cJ"; # Code that reads Parses CSV { local $/ = "\cM\cJ"; # end of line is now \cM\cJ while (<INFILE>){ if ($csv->parse($line) ){ my @columns=$csv->fields; # Process data here }else{ die "Error Parsing: $csv->error_input\n"; } } }` [download] -Lee "To be civilized is to deny one's nature."	[reply] [d/l]
Re (tilly) 4: Regexp nightmare by tilly (Archbishop) on May 29, 2001 at 06:31 UTC
Re: Re (tilly) 4: Regexp nightmare by shotgunefx (Parson) on May 29, 2001 at 11:47 UTC
Re: Re (tilly) 2: Regexp nightmare by Anonymous Monk on May 31, 2001 at 16:47 UTC
Lovely - does exactly what it says on the tin. I particularly like bind_header() and the ability to extract only those fields you require. Thankyou for that you have solved my prob. Pingu (logged in at work and can't remember my p/word ---	[reply]
Re: Regexp nightmare by JP Sama (Hermit) on May 28, 2001 at 20:24 UTC
I think you could just abandon the CSV file.. and use TAB (\t) as your delimiter... please check THIS NODE, by BBQ for more information! #!/jpsama/bin/perl -w $tks = `mount`; $jpsama = $! if $!; print $jpsama;	[reply]
Re: Regexp nightmare with CSV by larryk (Friar) on May 29, 2001 at 00:29 UTC
If your data is the same (_always_) then you can use a specific regex to get the data out. Or, perhaps more appropriately, to modify your delimiters: Case 1 - permanent regex: `for my $line (@lines_from_data_file) { my($idx,$fname,$sname,$loc,$time) = $line =~ /^(\d+),("[^"]+"),("[^"]+"),("[^"]+"),(.*)$/; }` [download] Case 2 - one-liner to modify delimiters (in place) `perl -i.bak -ne "s/([\d\x22]),/$1.'\|'/eg;print" datafile # for some reason I can't use single quotes for a # perl -e on Win32 so I have to use \x22 for "` [download] Case 3 - just realised you can use the regex above (slight mod.) for your split. `@data = split /([\d"]+),/;` [download] I still suggest that case 2 is your best option - you're just making more work for yourself if you don't. "Argument is futile - you will be ignorralated!"	[reply] [d/l] [select]
Re: Regexp nightmare by Pingu (Sexton) on May 28, 2001 at 20:14 UTC
Arrgh, formatting hassle. Apologies for the, err missing bits... maybe if I type: `while (<FP>) { s/"(.+?),(.+?)"/\1===\2/g; (@row) = split(/,/); foreach (@row) { s/===/,/g; } }` [download] it might be clearer.... or perhaps not.	[reply] [d/l]
Re: Regexp nightmare by Pingu (Sexton) on May 28, 2001 at 20:17 UTC
Holy regexp, 2 replies before I'd even got the question right! Thanks a million! Pingu	[reply]