Groxx has asked for the wisdom of the Perl Monks concerning the following question:

I'm learning perl, and am trying to apply it to convert a CSV database so I can upload the modified database to another location. The wonderful thing being that the CSV one program puts out can have quoted strings, which can contain commas and quotation marks that are unmarked/unescaped in any way. By fiddling around, and trying the regular expression in other applications that have regexp support, I think I've figured out an expression that works (by matching the data I need, not the commas. I couldn't figure out how to make more complicated (ie, full regular expression based) lookbehinds. Is this possible?)

split /(\".*?\"(?=,))|(.*?(?=,))|(.*?(?=\n))/

Parenthesis pairs for preserving data. This also gives me the fun part of blank results, which I take care of below, and with an is-defined check before doing anything with the data.

Now comes the fun part: logically speaking, I can't find anything wrong with it, and every application I've tested it in has worked perfectly. When I slap it into a Perl program, it doesn't. I can post my whole program and a testable part of the database if it's needed.

Then, in the same program (and working off the split string from the expression), I have a statement like this:

my $j=0; if (m/^,/, $i){ print $j++ . ": " . $i . "\n"; }

It helps me weed out blank and non-relevant results. One problem comes up. The remaining results look kinda like this:

0: item1 1: ,field2 2: ,more data

etc etc. The first result is what's getting my attention. There's no comma leading the data, but it matches anyway... how? And where did the commas come in in the first place? They're not part of the initial expression's results...

Any help? I've fiddled with this thing for a few hours, and each time the results of the regular expression fail to make any sense, matching data that doesn't match, and not separating data correctly. The most blatant example would be the above example, where m/^,/ matches a string that doesn't have a comma at the beginning.

If it matters, perl -v in my terminal returns this (and yes, I'm running OSX):

This is perl, v5.8.6 built for darwin-thread-multi-2level

(with 3 registered patches, see perl -V for more detail)

Copyright 1987-2004, Larry Wall

Perl may be copied only under the terms of either the Artistic License or the GNU General Public License, which may be found in the Perl 5 source kit.

Complete documentation for Perl, including FAQ lists, should be found on this system using `man perl' or `perldoc perl'. If you have access to the Internet, point your browser at http://www.perl.org/, the Perl Home Page.

Thanks!

  • Comment on Perl is returning... odd results... from regular expressions. Things matching when they shouldn't, and stuff like that.
  • Select or Download Code

Replies are listed 'Best First'.
Re: Perl is returning... odd results... from regular expressions. Things matching when they shouldn't, and stuff like that.
by ikegami (Patriarch) on Jan 11, 2007 at 05:08 UTC

    There's no comma leading the data, but it matches anyway... how?

    What makes you think it matched? In the code you presented, m/^,/ is evaluated, then its value is discarded, then $i is evaluated, and its value is used to determine whether to enter the if or not. Since $i is true, the if is entered.

    Where you trying to do the following?

    my $j=0; if ($i =~ m/^,/){ print $j++ . ": " . $i . "\n"; }
      I thought I saw somewhere that m//,STRING worked... Guess not, my mistake ^^;

      Thanks! That's part of the problem, at least. Very much appreciated!

        m// is the same as $_ =~ m//
Re: Perl is returning... odd results... from regular expressions. Things matching when they shouldn't, and stuff like that.
by merlyn (Sage) on Jan 11, 2007 at 05:43 UTC
    split /(\".*?\"(?=,))|(.*?(?=,))|(.*?(?=\n))/
    Yeeech. I guess when all you have in your toolbox is a hammer, everything gets a little banged up, regardless.

    Had you considered using @result = /pattern/g instead of split? What I've found is that in general, if it's easier to talk about what you're keeping than what you're throwing away, the match wins over the split.

      I'll definitely give that a try. I tend to bite off more than I can chew at times, and this is my first real Perl project, so I wasn't quite sure what to do.

      Thanks! This should help clean up some of the results, to say the least :)

Re: Perl is returning... odd results... from regular expressions. Things matching when they shouldn't, and stuff like that.
by SheridanCat (Pilgrim) on Jan 11, 2007 at 05:04 UTC
    When embarking on something in Perl that seems simple but turns out to be complicated - which parsing CSV is - you should start looking around CPAN. Someone has very often already tackled your problem and provided a nice module to help others.

    In your situation, take a look at Text::CSV::Simple or another CSV module. Trying to roll your own regex for this type of thing can be an exercise in frustration when you're really mostly interested in getting the job done.

      I'll keep that module in mind, though I actually DO want to do it myself if possible. I'm also making this for someone else, so I want him to be able to modify it in my absense if needed (so it'll be commented like mad).

      Thanks for the link! Again, I'm new at this, so I probably wouldn't have found a good one on my first try.

        Actually, if you want someone else to be able to modify it in your absence, you'll even more want to use the module. The module's interface is way better than trying to read the code that actually does the work.

        Really.

        I mean it.

        In fact, if you want to really make your life easy, you may want to use DBD::CSV where you can just use a bit of SQL to insert into a new CSV table some sort of SELECT from the old CSV table. A lot of magic will happen under the covers, but it's magic that you don't need to write, maintain, comment, or play with. Same goes for your friend ;-) What you're left with is some really easy-to-tweak code that your friend should have much less problem playing with.

Re: Perl is returning... odd results... from regular expressions. Things matching when they shouldn't, and stuff like that.
by shigetsu (Hermit) on Jan 11, 2007 at 05:27 UTC

    First off, may we see an excerpt of the relevant records?
    Is it a CSV file or database record? It obviously can't be both.

    split /(\".*?\"(?=,))|(.*?(?=,))|(.*?(?=\n))/

    Using $1 which results from your capturing parentheses in addition with split seems weird. Imagine, you're excluding the chunks specified by the pattern from the resulting list and capturing the values which match the pattern (\".*?\"(?=,))|(.*?(?=,))|(.*?(?=\n)) itself.

    The lookaheads are okay though.

      I guess it wasn't clear enough earlier, sorry. This is for a CSV file so I can nab the relevant data, convert it, and spit it out as a different, re-ordered CSV file for importing. I need to be able to convert between the two CSV data formats (accounting application and website), as the two main pieces of software are completely incompatible with each other. They can both import and export CSV, though, so I figured this would probably be the easiest, most flexible option (learning Perl aside).

      A few chunks of the CSV file (not complete lines, just representative of all the circumstances that could cause problems, with some extra data) are below:

      Bag10x8x24,Poly Bags 10x8x24 gussetted,1,FALSE,FALSE,,"Poly Bags 10x8x +24 gussetted metallocene bags; Assoc. Bag # 264-4-64 (500/carton, 1 c +arton min)",0.00,NC,0,0.00,0.00,NC,0,0.00,0.00,NC Bag2.5x3zip,2.5x3x.004 zip lock bag,1,FALSE,FALSE,,"2-1/2 x 3x .004" z +ip lock bags with hang hole. Assoc Bag item #274-03H",0.00,NC,0,0.00, +0.00,NC,0,0.00,0.00,PL1*0.6700000,0,0.00,0.00 H06045-fullthd,"M6 x 45 hex cap scrw, full thd",1,FALSE,FALSE,"M6 x 45 + hex cap scrw, full thd, class 8.8, zinc (C)","M6 x 45 hex cap scrw, +full thd, class 8.8, zinc, Bossard article # 1049577",0.16,NC,0,0.00, +0.00,NC,0,0.00,0.10,PL1*0.6300000,0,0.00,0.10

      There can effectively be any number of quotes or commas inside a quote-delimited field (though I'm not sure what the export does if a quote mark is followed by a comma in a description field... it hasn't happened before though, and it's not really a concern as it's easily enough avoided), and there can effectively be any number of quoted fields per line. There are also many blank (no data at all) fields, ALL of which have to be tracked and accounted for.

      As to the split, I noticed while reading through my Perl book that, when given parenthesis, split// returns the results of the matches (normally discarded) and the remaining data (normally retained). If nothing else, I figured it'd be handy for double-checking my regular expressions, as I could see what it dropped too.

      Thanks for the reply!

Re: Perl is returning... odd results... from regular expressions. Things matching when they shouldn't, and stuff like that.
by bart (Canon) on Jan 11, 2007 at 13:00 UTC
    split /(\".*?\"(?=,))|(.*?(?=,))|(.*?(?=\n))/
    /(?=,)/ is a lookahead: it looks at the comma, but it doesn't include it in the match. That's why it's not at the end of the matched strings. Nowhere do you get rid of the commas, so they still appear at the start of the next match, which is a problem: your doublequoted strings will never be recognized after the first match, because of this comma!

    Also, I'm not convinced your use of split is the best advice. Why not use //g?

    $_ = qq(item1,field2,more data,"a quoted, comma containing string"\n); @data = /(\".*?\"|.*?)[\n,]/g; $j = 0; printf "%d: %s\", $j++, $_ for @data;
    Result:
    0: item1 1: field2 2: more data 3: "a quoted, comma containing string"

    p.s. I didn't use this, but you're probably better off testing with defined than with a truth value to weed out unused captures.

      Nowhere do you get rid of the commas, so they still appear at the start of the next match, which is a problem: your doublequoted strings will never be recognized after the first match, because of this comma!

      Actually, I was having that problem as well, and that explains it nicely. Thanks! I'll give your code a try as well.

Re: Perl is returning... odd results... from regular expressions. Things matching when they shouldn't, and stuff like that.
by Sagacity (Monk) on Jan 11, 2007 at 06:06 UTC

    First off, I agree with the other posts!

    If you are looking for a pattern that has first a quotation mark, then any characters to another quotation mark, an = sign, and finally a comma.

    The next pattern being the same as the first without the quotation marks

    And finally, any characters, and = sign, and the newline character, then try the re-write and see if you get better results and tweek it from there.

    The * and the ? next to each other are redundant especially after the wildcard . (which means any character), and the * meaning 0 or more them.

    split /(\".*?\"(?=,))|(.*?(?=,))|(.*?(?=\n))/

    I think you are looking something more like this. The second pattern and the first end up being redundant, so I removed the first pattern. Please not that it has been a long time since I have worked on this type of pattern matching, and I may completely missed the mark

    $some_value = split (/.*\=,|.*\=\n$/, $some_scalar);

      Actually, the "?" is useful. It makes it non-greedy. As this CSV file can have multiple strings, if it isn't included, it returns EVERYTHING between the first and last quote mark.

      As to the other question marks, like the (?=,) portion, those are lookaheads.

      I appreciate the reply, though! Thanks!

Re: Perl is returning... odd results... from regular expressions. Things matching when they shouldn't, and stuff like that.
by Melly (Chaplain) on Jan 11, 2007 at 10:27 UTC

    Well, if you don't want to use a module, you could try something like the following code - it basically breaks the job down into several parts. The only major requirement is that all your quotes should be valid pairs (you should probably add a test to check that you have an even number of quotes and that you have the number of fields per line that you expect).

    1. Pull out the quoted sections
    2. Replace ',' with '_comma_' in the quoted sections
    3. Restore the quoted sections back into position
    4. Safely split on ',' (since quoted commas are now '_comma_')
    5. Replace '_comma_' with ','

    Here's the code:

    use strict; my @output; while(<DATA>){ chomp; next unless $_ =~ /\S/; # push any quoted stuff (incl. quotes) onto array... (we assume that + all quotes are paired) push my @quoted, ($_ =~ /"([^"]*)"/g); # replace any commas in the array with '_comma_' foreach my $quote(@quoted){ $quote =~ s/,/_comma_/g; } # now replace the ',' versions with the "_comma_" versions $_ =~ s/"[^"]*"/'"' . (shift @quoted) . '"'/ge; # now we can safely split on any commas (quoted commas are now '_com +ma') push @output, [split /,/]; # finally, replace any '_comma_' values with ',' in the latest eleme +nt of output foreach(@{$output[$#output]}){ s/_comma_/,/g; } } # what have we got? foreach(@output){ foreach(@{$_}){ print "$_:"; } print "\n"; } __DATA__ 123,456,"hello, world, goodbye, world",789 123,456,"hello, world, goodbye, world",789,"foo, bar","bar, foo" "hello, world","goodbye, world",123,"foo" "hello" 123,456,"goodbye, world",789
    map{$a=1-$_/10;map{$d=$a;$e=$b=$_/20-2;map{($d,$e)=(2*$d*$e+$a,$e**2 -$d**2+$b);$c=$d**2+$e**2>4?$d=8:_}1..50;print$c}0..59;print$/}0..20
    Tom Melly, pm@tomandlu.co.uk
Re: Perl is returning... odd results... from regular expressions. Things matching when they shouldn't, and stuff like that.
by Anonymous Monk on Jan 11, 2007 at 08:32 UTC
    Perl is returning... odd results... I'm learning perl,

    :) Makes me wonder what Dominus migt say (:

    Of course it doesn't work! That's because you don't know what you are doing!

    Ah yes, and you are the first person to have noticed this bug since 1987. Sure.

    Yes, that's what it's supposed to do when you say that.

    Well, what did you expect?

    The bug is in you, not in Perl.

    So you threw in some random punctuation for no particular reason, and then you didn't get the result you expected. Hmmmm.