The scalar /\G..../gc is a powerful construct to inchworm through a string. Here's an example of extracting whitespace delimited words from a string, which might include phrases that are single or double quoted.
$_ = q{whatever "this" 'line is'}; my @elements; push @elements, $1 while /\G\s*"(.*?)"/gc or /\G\s*'(.*?)'/gc or /\G\s*(\S+)/gc; print map "<<$_>>", @elements;

Replies are listed 'Best First'.
Re: Extract potentially quoted words
by VSarkiss (Monsignor) on Jun 07, 2001 at 00:15 UTC
    Actually, what I learned from this is another lesson in precedence. I would've thought to write it as:
    push @elements, $1 while /\G\s*"(.*?)"/gc || /\G\s*'(.*?)'/gc || /\G\s*(\S+)/gc;
    I usually treat or as lower precedence than anything -- I use it almost exclusively as do_something() or die "Help!" After going through the perlop and perlsyn man pages, I think I see now: the ors are part of the expression that follows the while modifier, not part of the full expression.

    Sooo.... In this case, the || works like the or. You could make an "or die" type of thing by parenthesizing the while expression:

    push @elements, $1 while (/\G\s*"(.*?)"/gc or /\G\s*'(.*?)'/gc or /\G\s*(\S+)/gc) or die "Argh";
    Although this is stupid since the while will return a false value at some point (you hope!), so you'll always die. But if you changed it to an and, you could detect if the loop never executed. (Hmmm... Potentially useful trick.)

      Although in Perl, most statements can also be used as expressions, there are a few exceptions. The while above is a statement modifier and cannot be used in an expression.

      For example:

      push @elements, $1 while (/\G\s*"(.*?)"/gc or /\G\s*'(.*?)'/gc or /\G\s*(\S+)/gc) or die "Argh";
      is still parsed as:
      push @elements, $1 while( (/\G\s*"(.*?)"/gc or /\G\s*'(.*?)'/gc or /\G\s*(\S+)/gc) or die "Argh" );
      and if we try to force your desired interpretation:
      (push @elements, $1 while /\G\s*"(.*?)"/gc or /\G\s*'(.*?)'/gc or /\G\s*(\S+)/gc) or die "Argh";
      we get: syntax error at line 1, near "$1  while"

              - tye (but my friends call me "Tye")
        No, there are no statements that can be used as expressions. What were you thinking? There's a clear delineation in Perl between statement things and expression things, and never the twain shall meet.

        -- Randal L. Schwartz, Perl hacker

        Aha! I knew your last example wouldn't work because I tried it. But I misinterpreted how the code I posted was working.

        Thanks for the correction, I appreciate it.

(bbfu) (dot star) Re: Extract potentially quoted words
by bbfu (Curate) on Jun 07, 2001 at 03:57 UTC

    Hrm. You might consider using [^"]* and [^']* instead of .*?.

    Just a thought. ;-)

    bbfu
    Seasons don't fear The Reaper.
    Nor do the wind, the sun, and the rain.
    We can be like they are.

      I might consider it, but what would the point be? Both walk the minimal chars to get to the result. Perhaps there'll be an ever-so-slight speed improvement. Now, if it had been .* instead of .*?, I'd see that.

      -- Randal L. Schwartz, Perl hacker

        I would say that using a negated character class is more efficient than using minimal matching:
        Rate minimal_c neg_class_c minimal neg_class minimal_c 94887/s -- -11% -40% -44% neg_class_c 106974/s 13% -- -32% -37% minimal 157452/s 66% 47% -- -8% neg_class 170558/s 80% 59% 8% --
        minimal_c and neg_class_c use capturing parentheses; minimal and neg_class don't. Either way there's a small but noticeable advantage for the negated character class.

        The reason for this isn't too hard to figure. With a negated character-class, the regex engine does almost no backtracking. The first thing it tries to match is [^"]+, and it succeeds every time until it finds the ".

        With minimal matching, however, the engine backtracks after every character. The first thing it tries to match is ", and it fails every time, then backs up and tries matching .+?, until it finds the ". It's doing more work that way, so it's slower.

        #!perl -w use strict; use Benchmark; Benchmark->import(qw/cmpthese/) if $^V; my $time = shift || 10; my $len = shift || 1000; my $abc = 'abc' x $len; my $str = '$abc"$abc"$abc'; my %bms = ( minimal => sub { $str =~ /".*?"/ }, neg_class => sub { $str =~ /"[^\"]*"/ }, minimal_c => sub { $str =~ /"(.*?)"/ }, neg_class_c => sub { $str =~ /"([^\"]*)"/ }, ); if ($^V) { cmpthese(-$time, \%bms); } else { timethese(-$time, \%bms); }

        *shrug* The latter is more efficient (if only by a little, as you point out) and (to me, anyway) a little more clear, conceptually. And there's no real reason not to, besides personal preference. My preference is, why make the RE do more than it needs to? :-)

        Anyway, it was just a thought. It is a useful snippet, btw. :-)

        bbfu
        Seasons don't fear The Reaper.
        Nor do the wind, the sun, and the rain.
        We can be like they are.

        Negated Character Classes include the newline. Dot doesn't.