in reply to Regexp glitch while parsing record in CSV

Looks to me like your regexp is saying:
... \s*([^,]+)\s*,... Find some whitespace, or perhaps none at all. Find anything that is not a comma, at least once, which *includes* whitespace, then... Find some more whitespace, perhaps none Find a comma

Since perl is greedy, it takes the whitespace into the parens, and leaves nothing for the \s* afterwards to match.

I came up with this:

m#(?:[^,]*,\s*){3}(.*?)\s*,#
which seems to do the trick. Breaking it down:
(?: ## Group, but do not store it into $1 [^,]* ## Anything that is not a comma , ## Followed by a comma \s* ## Followed by possible whitespace ){3} ## Find three of these (the first three) (.*?) ## Match any character, but don't be so greedy about it \s* ## Possible whitespace , ## Stops at first comma, because we are not being greedy

There is probably a better way to write it, but I'm tired and this seems to work.

Replies are listed 'Best First'.
RE: Re: Regexp glitch while parsing record in CSV
by greenhorn (Sexton) on Jul 17, 2000 at 10:10 UTC
    <kbd>> m#(?:[^,]*,\s*){3}(.*?)\s*,#</kbd>

    This worked, thanks--and taught me a couple of things about regular expressions that I'd seen in several of the books but hadn't yet understood.

    Still perplexed by the failure of the other regexp. Even though<kbd> [^,]+ </kbd>does indeed include spaces, I'm perplexed by why the<kbd> \s* </kbd>preceding and following it fail to catch spaces when they exist in those locations in the string.<kbd> \s* </kbd>does catch them when it is used this way:

    <kbd>split /\s*,\s*/ , $_</kbd>

    Thanks again.

      The asterisk can be a little tricky. It means, match zero or more of the preceding. In a construct like \s*[^,]+\s*, the character class does match the spaces and the comma. The + is greedy, too.
(Ovid) RE: Re: Regexp glitch while parsing record in CSV
by Ovid (Cardinal) on Jul 17, 2000 at 20:48 UTC
    I have two problems with the above regex. This first is that everything except the commas is optional.
    #!/usr/bin/perl $_ = ",,,,"; print "Good\n" if /(?:[^,]*,\s*){3}(.*?)\s*,/;
    The above code will print "Good\n". If you've worked with CSV data for any length of time you know that the odds are good that sooner or later you'll get a line with all commas (at least I have). Depending upon what is done with the data, you could have serious data corruption.

    Also, I would try to avoid the (.*?)\s*, construct. It's not terribly specific and can cause problems. ([^,]+)\s*, is very specific and is more appropriate. In fact, if you know that the data you are capturing won't have any embedded spaces or tabs (and I'm assuming that everything is on one line), you can use ([^, \t]+),.

    In this case, I don't feel that it will cause a problem with how your regex is crafted, but subtle errors can creep in down the road as maintenance occurs. Your regex is fine because the whitespace behind it is optional, but the negated character class is almost always preferable because it states exactly what you want.

    Consider the following problem: you want to print the first field of comma-delimited text if the last character prior to the comma is a sharp (#), but you don't want to capture the sharp. If the data doesn't fit this format, you want the regex to fail completely. The following regex looks fine at first glance:

    print "$1\n" if /^(.+?)#,/;
    It is, however, a bad choice. The negated character class is proper:
    #!/usr/bin/perl $_ = "test1, test2#,test3"; print "$1\n" if /^(.+?)#,/; # Returns a false positive print "$1\n" if /^([^,]+)#,/; # This fails, as we expect
    The first regex above will print test1, test2. I'm not trying to sound picky, but any time I see the .* or .+ used in a regex, I always look for a way to remove it because it's not terribly precise.