Re: Perl is returning... odd results... from regular expressions. Things matching when they shouldn't, and stuff like that.
by ikegami (Patriarch) on Jan 11, 2007 at 05:08 UTC
|
There's no comma leading the data, but it matches anyway... how?
What makes you think it matched? In the code you presented, m/^,/ is evaluated, then its value is discarded, then $i is evaluated, and its value is used to determine whether to enter the if or not. Since $i is true, the if is entered.
Where you trying to do the following?
my $j=0;
if ($i =~ m/^,/){
print $j++ . ": " . $i . "\n";
}
| [reply] [d/l] [select] |
|
|
| [reply] |
|
|
m// is the same as $_ =~ m//
| [reply] |
|
|
|
|
Re: Perl is returning... odd results... from regular expressions. Things matching when they shouldn't, and stuff like that.
by merlyn (Sage) on Jan 11, 2007 at 05:43 UTC
|
split /(\".*?\"(?=,))|(.*?(?=,))|(.*?(?=\n))/
Yeeech. I guess when all you have in your toolbox is a hammer, everything gets a little banged up, regardless.
Had you considered using @result = /pattern/g instead of split?
What I've found is that in general, if it's easier to talk about what you're keeping than what you're throwing away, the match wins over the split.
| [reply] [d/l] [select] |
|
|
| [reply] |
Re: Perl is returning... odd results... from regular expressions. Things matching when they shouldn't, and stuff like that.
by SheridanCat (Pilgrim) on Jan 11, 2007 at 05:04 UTC
|
When embarking on something in Perl that seems simple but turns out to be complicated - which parsing CSV is - you should start looking around CPAN. Someone has very often already tackled your problem and provided a nice module to help others.
In your situation, take a look at Text::CSV::Simple or another CSV module. Trying to roll your own regex for this type of thing can be an exercise in frustration when you're really mostly interested in getting the job done. | [reply] |
|
|
| [reply] |
|
|
Actually, if you want someone else to be able to modify it in your absence, you'll even more want to use the module. The module's interface is way better than trying to read the code that actually does the work.
Really.
I mean it.
In fact, if you want to really make your life easy, you may want to use DBD::CSV where you can just use a bit of SQL to insert into a new CSV table some sort of SELECT from the old CSV table. A lot of magic will happen under the covers, but it's magic that you don't need to write, maintain, comment, or play with. Same goes for your friend ;-) What you're left with is some really easy-to-tweak code that your friend should have much less problem playing with.
| [reply] |
Re: Perl is returning... odd results... from regular expressions. Things matching when they shouldn't, and stuff like that.
by shigetsu (Hermit) on Jan 11, 2007 at 05:27 UTC
|
First off, may we see an excerpt of the relevant records?
Is it a CSV file or database record? It obviously can't be both.
split /(\".*?\"(?=,))|(.*?(?=,))|(.*?(?=\n))/
Using $1 which results from your capturing parentheses in addition with split seems weird.
Imagine, you're excluding the chunks specified by the pattern from the resulting list and capturing the values which match the pattern (\".*?\"(?=,))|(.*?(?=,))|(.*?(?=\n)) itself.
The lookaheads are okay though.
| [reply] [d/l] [select] |
|
|
Bag10x8x24,Poly Bags 10x8x24 gussetted,1,FALSE,FALSE,,"Poly Bags 10x8x
+24 gussetted metallocene bags; Assoc. Bag # 264-4-64 (500/carton, 1 c
+arton min)",0.00,NC,0,0.00,0.00,NC,0,0.00,0.00,NC
Bag2.5x3zip,2.5x3x.004 zip lock bag,1,FALSE,FALSE,,"2-1/2 x 3x .004" z
+ip lock bags with hang hole. Assoc Bag item #274-03H",0.00,NC,0,0.00,
+0.00,NC,0,0.00,0.00,PL1*0.6700000,0,0.00,0.00
H06045-fullthd,"M6 x 45 hex cap scrw, full thd",1,FALSE,FALSE,"M6 x 45
+ hex cap scrw, full thd, class 8.8, zinc (C)","M6 x 45 hex cap scrw,
+full thd, class 8.8, zinc, Bossard article # 1049577",0.16,NC,0,0.00,
+0.00,NC,0,0.00,0.10,PL1*0.6300000,0,0.00,0.10
There can effectively be any number of quotes or commas inside a quote-delimited field (though I'm not sure what the export does if a quote mark is followed by a comma in a description field... it hasn't happened before though, and it's not really a concern as it's easily enough avoided), and there can effectively be any number of quoted fields per line. There are also many blank (no data at all) fields, ALL of which have to be tracked and accounted for.
As to the split, I noticed while reading through my Perl book that, when given parenthesis, split// returns the results of the matches (normally discarded) and the remaining data (normally retained). If nothing else, I figured it'd be handy for double-checking my regular expressions, as I could see what it dropped too.
Thanks for the reply! | [reply] [d/l] |
Re: Perl is returning... odd results... from regular expressions. Things matching when they shouldn't, and stuff like that.
by bart (Canon) on Jan 11, 2007 at 13:00 UTC
|
split /(\".*?\"(?=,))|(.*?(?=,))|(.*?(?=\n))/
/(?=,)/ is a lookahead: it looks at the comma, but it doesn't include it in the match. That's why it's not at the end of the matched strings. Nowhere do you get rid of the commas, so they still appear at the start of the next match, which is a problem: your doublequoted strings will never be recognized after the first match, because of this comma!
Also, I'm not convinced your use of split is the best advice. Why not use //g?
$_ = qq(item1,field2,more data,"a quoted, comma containing string"\n);
@data = /(\".*?\"|.*?)[\n,]/g;
$j = 0; printf "%d: %s\", $j++, $_ for @data;
Result:
0: item1
1: field2
2: more data
3: "a quoted, comma containing string"
p.s. I didn't use this, but you're probably better off testing with defined than with a truth value to weed out unused captures.
| [reply] [d/l] [select] |
|
|
| [reply] |
Re: Perl is returning... odd results... from regular expressions. Things matching when they shouldn't, and stuff like that.
by Sagacity (Monk) on Jan 11, 2007 at 06:06 UTC
|
First off, I agree with the other posts!
If you are looking for a pattern that has first a quotation mark,
then any characters to another quotation mark, an = sign, and finally
a comma.
The next pattern being the same as the first without the quotation marks
And finally, any characters, and = sign, and the newline character, then try the
re-write and see if you get better results and tweek it from there.
The * and the ? next to each other are redundant especially after the
wildcard . (which means any character), and the * meaning 0 or more them.
split /(\".*?\"(?=,))|(.*?(?=,))|(.*?(?=\n))/
I think you are looking something more like this.
The second pattern and the first end up being redundant, so I removed the
first pattern. Please not that it has been a long time since I have worked on
this type of pattern matching, and I may completely missed the mark
$some_value = split (/.*\=,|.*\=\n$/, $some_scalar);
| [reply] [d/l] |
|
|
Actually, the "?" is useful. It makes it non-greedy. As this CSV file can have multiple strings, if it isn't included, it returns EVERYTHING between the first and last quote mark.
As to the other question marks, like the (?=,) portion, those are lookaheads.
I appreciate the reply, though! Thanks!
| [reply] |
Re: Perl is returning... odd results... from regular expressions. Things matching when they shouldn't, and stuff like that.
by Melly (Chaplain) on Jan 11, 2007 at 10:27 UTC
|
Well, if you don't want to use a module, you could try something like the following code - it basically breaks the job down into several parts. The only major requirement is that all your quotes should be valid pairs (you should probably add a test to check that you have an even number of quotes and that you have the number of fields per line that you expect).
- Pull out the quoted sections
- Replace ',' with '_comma_' in the quoted sections
- Restore the quoted sections back into position
- Safely split on ',' (since quoted commas are now '_comma_')
- Replace '_comma_' with ','
Here's the code:
use strict;
my @output;
while(<DATA>){
chomp;
next unless $_ =~ /\S/;
# push any quoted stuff (incl. quotes) onto array... (we assume that
+ all quotes are paired)
push my @quoted, ($_ =~ /"([^"]*)"/g);
# replace any commas in the array with '_comma_'
foreach my $quote(@quoted){
$quote =~ s/,/_comma_/g;
}
# now replace the ',' versions with the "_comma_" versions
$_ =~ s/"[^"]*"/'"' . (shift @quoted) . '"'/ge;
# now we can safely split on any commas (quoted commas are now '_com
+ma')
push @output, [split /,/];
# finally, replace any '_comma_' values with ',' in the latest eleme
+nt of output
foreach(@{$output[$#output]}){
s/_comma_/,/g;
}
}
# what have we got?
foreach(@output){
foreach(@{$_}){
print "$_:";
}
print "\n";
}
__DATA__
123,456,"hello, world, goodbye, world",789
123,456,"hello, world, goodbye, world",789,"foo, bar","bar, foo"
"hello, world","goodbye, world",123,"foo"
"hello" 123,456,"goodbye, world",789
map{$a=1-$_/10;map{$d=$a;$e=$b=$_/20-2;map{($d,$e)=(2*$d*$e+$a,$e**2
-$d**2+$b);$c=$d**2+$e**2>4?$d=8:_}1..50;print$c}0..59;print$/}0..20
Tom Melly, pm@tomandlu.co.uk
| [reply] [d/l] [select] |
Re: Perl is returning... odd results... from regular expressions. Things matching when they shouldn't, and stuff like that.
by Anonymous Monk on Jan 11, 2007 at 08:32 UTC
|
Perl is returning... odd results... I'm learning perl,
:) Makes me wonder what Dominus migt say (:
Of course it doesn't work! That's because you don't know what you are doing!
Ah yes, and you are the first person to have noticed this bug since 1987. Sure.
Yes, that's what it's supposed to do when you say that.
Well, what did you expect?
The bug is in you, not in Perl.
So you threw in some random punctuation for no particular reason, and then you didn't get the result you expected. Hmmmm.
| [reply] |