Re: Regexp glitch while parsing record in CSV
by turnstep (Parson) on Jul 17, 2000 at 04:39 UTC
|
Looks to me like your regexp is saying:
... \s*([^,]+)\s*,...
Find some whitespace, or perhaps none at all.
Find anything that is not a comma, at least once,
which *includes* whitespace, then...
Find some more whitespace, perhaps none
Find a comma
Since perl is greedy, it takes the whitespace into
the parens, and leaves nothing for the \s* afterwards to
match.
I came up with this:
m#(?:[^,]*,\s*){3}(.*?)\s*,#
which seems to do the trick. Breaking it down:
(?: ## Group, but do not store it into $1
[^,]* ## Anything that is not a comma
, ## Followed by a comma
\s* ## Followed by possible whitespace
){3} ## Find three of these (the first three)
(.*?) ## Match any character, but don't be so greedy about it
\s* ## Possible whitespace
, ## Stops at first comma, because we are not being greedy
There is probably a better way to write it, but
I'm tired and this seems to work.
| [reply] [d/l] [select] |
|
|
<kbd>> m#(?:[^,]*,\s*){3}(.*?)\s*,#</kbd>
This worked, thanks--and taught me a couple of things about regular expressions that I'd seen in several of the books but hadn't yet understood.
Still perplexed by the failure of the other regexp. Even
though<kbd> [^,]+ </kbd>does indeed include spaces, I'm perplexed by why
the<kbd> \s* </kbd>preceding and following it fail to catch spaces
when they exist in those locations in the string.<kbd> \s* </kbd>does
catch them when it is used this way:
<kbd>split /\s*,\s*/ , $_</kbd>
Thanks again.
| [reply] |
|
|
The asterisk can be a little tricky. It means, match zero or more of the preceding. In a construct like \s*[^,]+\s*, the character class does match the spaces and the comma. The + is greedy, too.
| [reply] [d/l] |
|
|
I have two problems with the above regex. This first is that everything except the commas is optional.
#!/usr/bin/perl
$_ = ",,,,";
print "Good\n" if /(?:[^,]*,\s*){3}(.*?)\s*,/;
The above code will print "Good\n". If you've worked with CSV data for any length of time you know that the odds are good that sooner or later you'll get a line with all commas (at least I have). Depending upon what is done with the data, you could have serious data corruption.
Also, I would try to avoid the (.*?)\s*, construct. It's not terribly specific and can cause problems. ([^,]+)\s*, is very specific and is more appropriate. In fact, if you know that the data you are capturing won't have any embedded spaces or tabs (and I'm assuming that everything is on one line), you can use ([^, \t]+),.
In this case, I don't feel that it will cause a problem with how your regex is crafted, but subtle errors can creep in down the road as maintenance occurs. Your regex is fine because the whitespace behind it is optional, but the negated character class is almost always preferable because it states exactly what you want.
Consider the following problem: you want to print the first field of comma-delimited text if the last character prior to the comma is a sharp (#), but you don't want to capture the sharp. If the data doesn't fit this format, you want the regex to fail completely. The following regex looks fine at first glance:
print "$1\n" if /^(.+?)#,/;
It is, however, a bad choice. The negated character class is proper:
#!/usr/bin/perl
$_ = "test1, test2#,test3";
print "$1\n" if /^(.+?)#,/; # Returns a false positive
print "$1\n" if /^([^,]+)#,/; # This fails, as we expect
The first regex above will print test1, test2. I'm not trying to sound picky, but any time I see the .* or .+ used in a regex, I always look for a way to remove it because it's not terribly precise. | [reply] [d/l] [select] |
Re: Regexp glitch while parsing record in CSV
by Shoeboy (Sexton) on Jul 17, 2000 at 07:37 UTC
|
Have you considered optimizing split a little bit.
Your regex finds 4 commas and returns whatever's between the 3rd and 4th. Your split finds 11 commas and returns a list of 12 items. It's no suprise that the regex is faster. If you're only trying to return a part of a string, a regex seems to be more appropriate than split. Still, if you like the idea of using split (it is shorter and more readable) -- try limiting split to 5 items like so:
@record = split( /\s*,\s*/o, $_, 5);
$field4 = $record[3];
I don't know if split will speed up from this, but it's worth a try.
--Shoeboy
perl -e "do {kill $java, $ada, $cobol, $pascal, $csh;} until die 'Just another perl hacker';" | [reply] [d/l] [select] |
|
|
I have found this collection of replies downright fascinating. Bloody regular expressions, anyway. :)
Given some of the regexp pitfalls that have become obvious as I read through the replies, I find myself thinking that--TMTOWTDI notwithstanding--in this case the "correct" approach might well be the one you've suggested:
<kbd>@record = split( /\s*,\s*/o, $_, 5);
$field4 = $record[3];</kbd>
Judging by q&d benchmarking I did, restricting the split to 5 fields did speed things up a bit. So this approach, while not as sexy as the ones using the regular expressions:
1) gets the job done;
2) isn't breathtakingly fast--but isn't dismally slow, either;
3) is immediately understandable; a complex regexp might lead, later on, to some head-scratching...;
4) ensures, in a simple way, that the leading and trailing spaces are removed. Once more, thanks to all of you for the extensive feedback.
| [reply] |
Re: Regexp glitch while parsing record in CSV
by lhoward (Vicar) on Jul 17, 2000 at 04:29 UTC
|
This is slightly off-topic, but
have you considered using Text::CSV or DBD::CSV
for parsing the file? Both of those modules
are really convenient to use. | [reply] |
|
|
Concerning the modules on CPAN: I had a look at the links you provided (thanks for that). As often happens, I find that I don't grasp the reference material--I don't get the syntax. I will have to spend more time with sundry Perl books I have here to see if I can learn how such modules work.
Adding to my confusion: the links within the page to which you referred me appear to point to .PM files, but they download as HTML files, not as .PM files. Obviously I'll need to work out how to "use" CPAN as well. (Aside from all this, I'm quite knowledgeable. <g>)
| [reply] |
(jjhorner)Regexp glitch while parsing record in CSV
by jjhorner (Hermit) on Jul 17, 2000 at 06:35 UTC
|
#!/usr/bin/perl -w
use strict;
use Benchmark;
my $string = " field1 , field2 , field3 , field4 ,
field5 , field6 , field7 , field8 , field9 ,
field10 , field11 , field12 ";
sub regexp {
$string =~ m#^(\s*)(\w*?)(\s*),
(\s*)(\w*?)(\s*),(\s*)(\w*?)(\s*),
(\s*)(\w*?)(\s*),(\s*)(\w*?)(\s*),
(\s*)(\w*?)(\s*),(\s*)(\w*?)(\s*),
(\s*)(\w*?)(\s*),(\s*)(\w*?)(\s*),
(\s*)(\w*?)(\s*),(\s*)(\w*?)(\s*),
(\s*)(\w*?)(\s*)$#o;
my $s = $11;
print $s;
}
sub splitter {
my $s = (split(/\s*,\s*/, $string))[3];
print $s;
}
Yields:
[22:31:17] Orders, Master? ./greenhorn-0716.pl
Benchmark: timing 100000 iterations of regexpression, splitting...
regexpression: 12 wallclock secs (12.78 usr + 0.00 sys = 12.78 CPU) @
+ 7824.73/s (n=100000)
splitting: 8 wallclock secs ( 7.90 usr + 0.00 sys = 7.90 CPU) @ 12
+658.23/s (n=100000)
Is there something wrong with my code? I know that isn't the
best regular expression ever created, but I would like to
know why it doesn't beat the split.
Note: To whomever I saw with the shell prompt like
I have above, that is so cool!
J. J. Horner
Linux, Perl, Apache, Stronghold, Unix
jhorner@knoxlug.org http://www.knoxlug.org/
| [reply] [d/l] [select] |
|
|
Taking out the capturing parenthesis around the whitespace helped a lot. Here's the benchmark results:
Benchmark: timing 50000 iterations of regex, split...
regex: 3 wallclock secs ( 3.03 usr + 0.01 sys = 3.04 CPU)
split: 4 wallclock secs ( 3.79 usr + 0.00 sys = 3.79 CPU)
| [reply] [d/l] |
(Ovid) RE: Regexp glitch while parsing record in CSV
by Ovid (Cardinal) on Jul 17, 2000 at 21:30 UTC
|
Couple of comments:
- Reread lhoward's post. Those two modules might save you a lot of headaches. Your regex, for example, will not allow for embedded or escaped commas (assuming you ever get any).
- @record = split( /\s*,\s*/o, $_); is better written as @record = split /\s*,\s*/;.
- I'm not commenting on the meat of your node as you already have tons of good comments about that.
The /o modifier is not necessary if an interpolated variable is not in the regex. Perl automatically compiles the regex only once if it's a straight regex with no variable interpolation (I've made this mistake in posts here, too. See Mastering Regular Expressions, second edition, p159).
The /o modifier can be useful if you have a regex like /\s+${someword}ing/o. If that variable's value never changes, then using /o to force the regex compiler to only compile the expression once is good. However, if you use it to compile an expression with a variable whose value may change, then you're in trouble. The variable's value in the regex will be stuck in the first value and never get updated. The following code illustrates the danger:
while (<SOMEFILE>) {
chomp;
$found = 1 if $user =~ /$_/o;
}
If $user isn't the first one in SOMEFILE, then $user will never match. Incidentally, you can get around this limitation with anonymous subroutines. merlyn has an excellent article about this.
This node was brought to you by the letter "O". | [reply] [d/l] |
Re: Regexp glitch while parsing record in CSV
by Anonymous Monk on Jul 18, 2000 at 00:28 UTC
|
Just a quick note to add to Ovid's fine comments.
You can use ([^,]+?) to have the best
of both worlds (robust and specific as well as not being
so greedy as to suck up the trailing white space before
the \s* has a chance to get it.
You can also get fancier and help the regex engine match
quicker by doing something like
/^\s*([^,]*[^,\s])\s*,/, which (I think) will
reduce the amount of backtracking possible (is it called
"forwardtracking" in the case of non-greedy regex
elements? ;> ). Note that this, like the last ones,
doesn't match empty values.
You can support empty values with
/^\s*((?:[^,]*[^,\s])?)\s*,/, but then you are
going a long way for what may be little benefit.
| [reply] [d/l] [select] |
RE: Regexp glitch while parsing record in CSV
by mrmick (Curate) on Jul 17, 2000 at 15:38 UTC
|
If you are only concerned about the leading and trailing whitespace in the 4th field, I suggest the following (or something similar):
my @record = split( /,/);
my $field4 = $record[3];
$field4 =~ /^\s*//;
$field4 =~ /\s*$//;
I know that this isn't exactly a sophisticated regex but we are only working with one field. This is still faster than performing the replacement on all fields, I'm sure.
Mick | [reply] [d/l] |