Regexp glitch while parsing record in CSV

greenhorn has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: Regexp glitch while parsing record in CSV by turnstep (Parson) on Jul 17, 2000 at 04:39 UTC
Looks to me like your regexp is saying: `... \s([^,]+)\s,... Find some whitespace, or perhaps none at all. Find anything that is not a comma, at least once, which includes whitespace, then... Find some more whitespace, perhaps none Find a comma` [download] Since perl is greedy, it takes the whitespace into the parens, and leaves nothing for the \s* afterwards to match. I came up with this: `m#(?:[^,],\s){3}(.?)\s,#` [download] which seems to do the trick. Breaking it down: `(?: ## Group, but do not store it into $1 [^,]* ## Anything that is not a comma , ## Followed by a comma \s* ## Followed by possible whitespace ){3} ## Find three of these (the first three) (.?) ## Match any character, but don't be so greedy about it \s ## Possible whitespace , ## Stops at first comma, because we are not being greedy` [download] There is probably a better way to write it, but I'm tired and this seems to work.	[reply] [d/l] [select]
RE: Re: Regexp glitch while parsing record in CSV by greenhorn (Sexton) on Jul 17, 2000 at 10:10 UTC
<kbd>> m#(?:[^,],\s){3}(.?)\s,#</kbd> This worked, thanks--and taught me a couple of things about regular expressions that I'd seen in several of the books but hadn't yet understood. Still perplexed by the failure of the other regexp. Even though<kbd> [^,]+ </kbd>does indeed include spaces, I'm perplexed by why the<kbd> \s* </kbd>preceding and following it fail to catch spaces when they exist in those locations in the string.<kbd> \s* </kbd>does catch them when it is used this way: <kbd>split /\s,\s/ , $_</kbd> Thanks again.	[reply]
(chromatic) RE: RE: Re: Regexp glitch while parsing record in CSV by chromatic (Archbishop) on Jul 18, 2000 at 03:20 UTC
The asterisk can be a little tricky. It means, match zero or more of the preceding. In a construct like `\s[^,]+\s`, the character class does match the spaces and the comma. The + is greedy, too.	[reply] [d/l]
(Ovid) RE: Re: Regexp glitch while parsing record in CSV by Ovid (Cardinal) on Jul 17, 2000 at 20:48 UTC
I have two problems with the above regex. This first is that everything except the commas is optional. `#!/usr/bin/perl $_ = ",,,,"; print "Good\n" if /(?:[^,],\s){3}(.?)\s,/;` [download] The above code will print "Good\n". If you've worked with CSV data for any length of time you know that the odds are good that sooner or later you'll get a line with all commas (at least I have). Depending upon what is done with the data, you could have serious data corruption. Also, I would try to avoid the `(.?)\s,` construct. It's not terribly specific and can cause problems. `([^,]+)\s,` is very specific and is more appropriate. In fact, if you know that the data you are capturing won't have any embedded spaces or tabs (and I'm assuming that everything is on one line), you can use `([^, \t]+),`. In this case, I don't feel that it will cause a problem with how your regex is crafted, but subtle errors can creep in down the road as maintenance occurs. Your regex is fine because the whitespace behind it is optional, but the negated character class is almost always preferable because it states exactly what you want. Consider the following problem: you want to print the first field of comma-delimited text if the last character prior to the comma is a sharp (#), but you don't want to capture the sharp. If the data doesn't fit this format, you want the regex to fail completely. The following regex looks fine at first glance: `print "$1\n" if /^(.+?)#,/;` [download] It is, however, a bad choice. The negated character class is proper: `#!/usr/bin/perl $_ = "test1, test2#,test3"; print "$1\n" if /^(.+?)#,/; # Returns a false positive print "$1\n" if /^([^,]+)#,/; # This fails, as we expect` [download] The first regex above will print `test1, test2`. I'm not trying to sound picky, but any time I see the . or .+ used in a regex, I always look for a way to remove it because it's not terribly precise.	[reply] [d/l] [select]
Re: Regexp glitch while parsing record in CSV by Shoeboy (Sexton) on Jul 17, 2000 at 07:37 UTC
Have you considered optimizing split a little bit. Your regex finds 4 commas and returns whatever's between the 3rd and 4th. Your split finds 11 commas and returns a list of 12 items. It's no suprise that the regex is faster. If you're only trying to return a part of a string, a regex seems to be more appropriate than split. Still, if you like the idea of using split (it is shorter and more readable) -- try limiting split to 5 items like so: `@record = split( /\s,\s/o, $_, 5); $field4 = $record[3];` [download] I don't know if split will speed up from this, but it's worth a try. --Shoeboy `perl -e "do {kill $java, $ada, $cobol, $pascal, $csh;} until die 'Just another perl hacker';"`	[reply] [d/l] [select]
RE: Re: Regexp glitch while parsing record in CSV by greenhorn (Sexton) on Jul 18, 2000 at 00:50 UTC
I have found this collection of replies downright fascinating. Bloody regular expressions, anyway. :) Given some of the regexp pitfalls that have become obvious as I read through the replies, I find myself thinking that--TMTOWTDI notwithstanding--in this case the "correct" approach might well be the one you've suggested: <kbd>@record = split( /\s,\s/o, $_, 5); $field4 = $record[3];</kbd> Judging by q&d benchmarking I did, restricting the split to 5 fields did speed things up a bit. So this approach, while not as sexy as the ones using the regular expressions: 1) gets the job done; 2) isn't breathtakingly fast--but isn't dismally slow, either; 3) is immediately understandable; a complex regexp might lead, later on, to some head-scratching...; 4) ensures, in a simple way, that the leading and trailing spaces are removed. Once more, thanks to all of you for the extensive feedback.	[reply]
Re: Regexp glitch while parsing record in CSV by lhoward (Vicar) on Jul 17, 2000 at 04:29 UTC
This is slightly off-topic, but have you considered using Text::CSV or DBD::CSV for parsing the file? Both of those modules are really convenient to use.	[reply]
RE: Re: Regexp glitch while parsing record in CSV by greenhorn (Sexton) on Jul 17, 2000 at 10:14 UTC
Concerning the modules on CPAN: I had a look at the links you provided (thanks for that). As often happens, I find that I don't grasp the reference material--I don't get the syntax. I will have to spend more time with sundry Perl books I have here to see if I can learn how such modules work. Adding to my confusion: the links within the page to which you referred me appear to point to .PM files, but they download as HTML files, not as .PM files. Obviously I'll need to work out how to "use" CPAN as well. (Aside from all this, I'm quite knowledgeable. <g>)	[reply]
(jjhorner)Regexp glitch while parsing record in CSV by jjhorner (Hermit) on Jul 17, 2000 at 06:35 UTC
I'm not getting the same results as everyone else: #!/usr/bin/perl -w use strict; use Benchmark; my $string = " field1 , field2 , field3 , field4 , field5 , field6 , field7 , field8 , field9 , field10 , field11 , field12 "; sub regexp { $string =~ m#^(\s)(\w?)(\s), (\s)(\w?)(\s),(\s)(\w?)(\s), (\s)(\w?)(\s),(\s)(\w?)(\s), (\s)(\w?)(\s),(\s)(\w?)(\s), (\s)(\w?)(\s),(\s)(\w?)(\s), (\s)(\w?)(\s),(\s)(\w?)(\s), (\s)(\w?)(\s)$#o; my $s = $11; print $s; } sub splitter { my $s = (split(/\s,\s/, $string))[3]; print $s; } [download] Yields: `[22:31:17] Orders, Master? ./greenhorn-0716.pl Benchmark: timing 100000 iterations of regexpression, splitting... regexpression: 12 wallclock secs (12.78 usr + 0.00 sys = 12.78 CPU) @ + 7824.73/s (n=100000) splitting: 8 wallclock secs ( 7.90 usr + 0.00 sys = 7.90 CPU) @ 12 +658.23/s (n=100000)` [download] Is there something wrong with my code? I know that isn't the best regular expression ever created, but I would like to know why it doesn't beat the split. Note: To whomever I saw with the shell prompt like I have above, that is so cool! J. J. Horner Linux, Perl, Apache, Stronghold, Unix jhorner@knoxlug.org http://www.knoxlug.org/	[reply] [d/l] [select]
(chromatic) RE: (jjhorner)Regexp glitch while parsing record in CSV by chromatic (Archbishop) on Jul 17, 2000 at 07:57 UTC
Taking out the capturing parenthesis around the whitespace helped a lot. Here's the benchmark results: `Benchmark: timing 50000 iterations of regex, split... regex: 3 wallclock secs ( 3.03 usr + 0.01 sys = 3.04 CPU) split: 4 wallclock secs ( 3.79 usr + 0.00 sys = 3.79 CPU)` [download]	[reply] [d/l]
(Ovid) RE: Regexp glitch while parsing record in CSV by Ovid (Cardinal) on Jul 17, 2000 at 21:30 UTC
Couple of comments: Reread lhoward's post. Those two modules might save you a lot of headaches. Your regex, for example, will not allow for embedded or escaped commas (assuming you ever get any). `@record = split( /\s,\s/o, $_);` is better written as `@record = split /\s,\s/;`. I'm not commenting on the meat of your node as you already have tons of good comments about that. The /o modifier is not necessary if an interpolated variable is not in the regex. Perl automatically compiles the regex only once if it's a straight regex with no variable interpolation (I've made this mistake in posts here, too. See Mastering Regular Expressions, second edition, p159). The /o modifier can be useful if you have a regex like `/\s+${someword}ing/o`. If that variable's value never changes, then using /o to force the regex compiler to only compile the expression once is good. However, if you use it to compile an expression with a variable whose value may change, then you're in trouble. The variable's value in the regex will be stuck in the first value and never get updated. The following code illustrates the danger: `while (<SOMEFILE>) { chomp; $found = 1 if $user =~ /$_/o; }` [download] If $user isn't the first one in SOMEFILE, then $user will never match. Incidentally, you can get around this limitation with anonymous subroutines. merlyn has an excellent article about this. This node was brought to you by the letter "O".	[reply] [d/l]
Re: Regexp glitch while parsing record in CSV by Anonymous Monk on Jul 18, 2000 at 00:28 UTC
Just a quick note to add to Ovid's fine comments. You can use `([^,]+?)` to have the best of both worlds (robust and specific as well as not being so greedy as to suck up the trailing white space before the `\s` has a chance to get it. You can also get fancier and help the regex engine match quicker by doing something like `/^\s([^,][^,\s])\s,/`, which (I think) will reduce the amount of backtracking possible (is it called "forwardtracking" in the case of non-greedy regex elements? ;> ). Note that this, like the last ones, doesn't match empty values. You can support empty values with `/^\s((?:[^,][^,\s])?)\s*,/`, but then you are going a long way for what may be little benefit.	[reply] [d/l] [select]
RE: Regexp glitch while parsing record in CSV by mrmick (Curate) on Jul 17, 2000 at 15:38 UTC
If you are only concerned about the leading and trailing whitespace in the 4th field, I suggest the following (or something similar): `my @record = split( /,/); my $field4 = $record[3]; $field4 =~ /^\s//; $field4 =~ /\s$//;` [download] I know that this isn't exactly a sophisticated regex but we are only working with one field. This is still faster than performing the replacement on all fields, I'm sure. Mick	[reply] [d/l]