oha has asked for the wisdom of the Perl Monks concerning the following question:

Hi,

some time ago a monk asked in CB about parsing a CSV-like stream of data without installing any modules.
I played a bit with it as soon as i had some time at the office, and i find out that using the m//g in scalar context a way to do it (it will not fail on wrong data, but that's not the point).

$_ = "'foo',123,'bar\\'cuz', 'comma,comma',,'void'"; while(/('(.*?)'|([^']*?))($|,\s*)/g) { my $t = $2 || $3; print ">$t\n"; } ___________ >foo >123 >bar\'cuz >comma,comma > >void >
(Note: Not sure if this could help, but i get and extra-match at then end)

Then i tried doing the same in list content, and i got a weird behaviour; and i was not able to understand why:

$_ = "'foo',123,'bar\\'cuz', 'comma,comma',,'void'"; map { my $t = $2 || $3; print "<$t\n"; } m/('(.*?)'|([^']*?))($|,\s*)/g; ______________ <'foo' <foo < <, <123 < <123 <, <'bar\'cuz' <bar\'cuz < <, <'comma,comma' <comma,comma < <, < < < <, <'void' <void < < < < < <
Note: I confess i didn't understand correctly how the \G should be used with /g, but i tried using it at the start of the RE and i got the same result.

Oha

Updated: forgive me, i realized right now i got not the matches, but all the groups. this explain why i get quad the results, but i still not understand how $2 and $3 are passed

Replies are listed 'Best First'.
Re: m//g in list and scalar context differences?
by Anno (Deacon) on Sep 19, 2007 at 11:11 UTC
    map { my $t = $2 || $3; print "<$t\n"; } m/('(.*?)'|([^']*?))($|,\s*)/g;
    In list context, a global match that contains capturing parentheses returns a flattened list of all captured parts (three per match in your example) for all matches. That is the list you're mapping over, and it is built before the map even starts. The capture variables ($1, $2) you are refering to in the map block are the ones left over from the last of possibly many global matches.

    In other words, map will run over a list three times as long as the number of global matches, but $2 and $3 will be the same during all iterations. That won't do at all what you expect.

    Anno

Re: m//g in list and scalar context differences?
by NetWallah (Canon) on Sep 19, 2007 at 20:18 UTC
    This should get you close to what you want..
    >perl -e "my $x = qq['foo',123,'bar\\'cuz', 'comma,comma',,'void']; map { print qq[<$_;\n]; } $x=~m/((?:'[^']*'|[^'\s,]+|))(?:[$|,\s]|$ +)/g;" <'foo'; <123; <'cuz'; <; <'comma,comma'; <; <'void'; <;
    There is an extra empty item returned. No easy way to avoid that, that I can think of.
    'bar\'cuz' is an invalid text string, in this context, so only 'cuz' is returned.

    Note the use of non-capturing parens "(?:" - quite useful in this context, so each iteration returns only a single capture, because the first paren IS a capturing paren.

         "As you get older three things happen. The first is your memory goes, and I can't remember the other two... " - Sir Norman Wisdom