InfiniteSilence has asked for the wisdom of the Perl Monks concerning the following question:

After reading a merlyn article I tried to use the /g modifier in a regex in list context to populate an array with matches, like so:
#!/usr/bin/perl -w use strict; my $mm=qq(\"000.E+3\",\"\",\"\",\"\",\"QCA-086_2\",\"-1\",\"P\",\"FALS +E\"); my @p = ($mm=~m/(\"[A-Za-z0-9_\-,.+]+\")/g); print map {qq($_\n)} @p; 1;
And I get:
"000.E+3" "," "," "," "," "," ","
When I try and make the match optional:
#!/usr/bin/perl -w use strict; my $mm=qq(\"000.E+3\",\"\",\"\",\"\",\"QCA-086_2\",\"-1\",\"P\",\"FALS +E\"); my @p = ($mm=~m/(\"(?:[A-Za-z0-9_\-,.+]+)?\")/g); print map {qq($_\n)} @p; 1;
I get:
"000.E+3" "" "" "" "QCA-086_2" "-1" "P" "FALSE"
Which is what I wanted. I understand that in the first case the regex failed on the second element of $mm but shouldn't it have continued on to the end of the string? Second, where did the comma characters in the output come from?

Celebrate Intellectual Diversity

Replies are listed 'Best First'.
Re: regex match in list context
by flounder99 (Friar) on Oct 02, 2003 at 19:07 UTC
    What happened is that once it failed on "" it just skipped the first quote and moved on to the next character which was the closing quote so it matched on "," and continued from there. There are several ways to do what you want. Some are simple if you don't have to worry about embeded escaped quotes some thing like
    /("[^"]*?")/g
    Would work.
    If you have embeded escaped quotes you can use this:
    /("(?:\\"|[^"])*?")/g
    like this
    my $mm=qq(\"000.E+3\",\"\",\"\",\"\",\"QCA-086_2\",\"-1\",\"P\",\"FALS +E\",\"this \\\"is\\\" quoted\"); my @p = ($mm=~m/("(?:\\"|[^"])*?")/g); print $mm, "\n", map {qq($_\n)} @p; __OUTPUT__ "000.E+3","","","","QCA-086_2","-1","P","FALSE","this \"is\" quoted" "000.E+3" "" "" "" "QCA-086_2" "-1" "P" "FALSE" "this \"is\" quoted"
    There are several modules that you can use like Text::Balanced and Regexp::Common

    --

    flounder

•Re: regex match in list context
by merlyn (Sage) on Oct 02, 2003 at 19:01 UTC
Re: regex match in list context
by delirium (Chaplain) on Oct 02, 2003 at 18:59 UTC
    Your results are what the regex is supposed to do under both circumstances. In the first case, the "" match failed, so the regex pointer moved over one character and found "," which matched since comma is one of the things you're looking for. After that, the engine is going to keep finding matches with commas.

    I think you may have wanted to take the comma out of the [ ] and change + to * to keep empty fields. Like so:

    #!/usr/bin/perl -w use strict; my $mm=qq(\"000.E+3\",\"\",\"\",\"\",\"QCA-086_2\",\"-1\",\"P\",\"FALS +E\"); my @p = ($mm=~m/(\"[A-Za-z0-9_\-.+]*\")/g); print map {qq($_\n)} @p; 1;

    And then again, there's always the handy split function.

Re: regex match in list context
by Aristotle (Chancellor) on Oct 02, 2003 at 20:45 UTC
    Short note first off:
    (?:[A-Za-z0-9_\-,.+]+)?
    is the same as
    [A-Za-z0-9_\-,.+]*

    except the former makes the matching engine work much harder.

    Another note: the whole point of using alternate delimiters is to not have to escape everything.. so get rid of some of that backslashed eye sore. :)

    Anyway, let's get to the point I wanted to make: you might have a look at \G which forces a match to start where the last one left off. For the very first match, it is equivalent to \A.

    my $mm = qq("000.E+3","","","","QCA-086_2","-1","P","FALSE"); my @p = ( $mm =~ m[\G ("[A-Za-z0-9_\-,.+]*") (?: , | \z ) ]g ); print map {qq($_\n)} @p; 1;
    This ensures that if any part of the line does not match your specification exactly, the matching engine will fail and abort, instead of trying to find something that looks like your specification somewhere further down the string. Coupled with some other check (f.ex, if you have a fixed number of fields, check if you matched that number of fields), it can be used to make sure that your script will die with a loud moan on invalid input, rather than go ahead silently to produce garbage output from garbage input.

    Makeshifts last the longest.

      shorter note:
      [A-Za-z0-9_\-,.+]*
      is the same as
      [\w,.+-]*
      except the former makes my brain work much harder.
        Be aware that the latter is affected by user's locale settings and/or Unicode encoding of strings, the former is not and always means the same thing.

        Makeshifts last the longest.