comment on

I have two problems with the above regex. This first is that everything except the commas is optional.

#!/usr/bin/perl

$_ = ",,,,";
print "Good\n" if /(?:[^,]*,\s*){3}(.*?)\s*,/;
[download]

The above code will print "Good\n". If you've worked with CSV data for any length of time you know that the odds are good that sooner or later you'll get a line with all commas (at least I have). Depending upon what is done with the data, you could have serious data corruption.

Also, I would try to avoid the (.*?)\s*, construct. It's not terribly specific and can cause problems. ([^,]+)\s*, is very specific and is more appropriate. In fact, if you know that the data you are capturing won't have any embedded spaces or tabs (and I'm assuming that everything is on one line), you can use ([^, \t]+),.

In this case, I don't feel that it will cause a problem with how your regex is crafted, but subtle errors can creep in down the road as maintenance occurs. Your regex is fine because the whitespace behind it is optional, but the negated character class is almost always preferable because it states exactly what you want.

Consider the following problem: you want to print the first field of comma-delimited text if the last character prior to the comma is a sharp (#), but you don't want to capture the sharp. If the data doesn't fit this format, you want the regex to fail completely. The following regex looks fine at first glance:

print "$1\n" if /^(.+?)#,/;
[download]

It is, however, a bad choice. The negated character class is proper:

#!/usr/bin/perl

$_ = "test1, test2#,test3";
print "$1\n" if /^(.+?)#,/;   # Returns a false positive
print "$1\n" if /^([^,]+)#,/; # This fails, as we expect
[download]

The first regex above will print test1, test2. I'm not trying to sound picky, but any time I see the .* or .+ used in a regex, I always look for a way to remove it because it's not terribly precise.

In reply to (Ovid) RE: Re: Regexp glitch while parsing record in CSV by Ovid
in thread Regexp glitch while parsing record in CSV by greenhorn

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.