tj_thompson has asked for the wisdom of the Perl Monks concerning the following question:

Hello monks,

I've recently been doing a large amount of parsing and have wandered into a bit of a quandary related to regexes, \G, and /g.

My recent code has been some state based lexing/parsing approaches for files that range from relatively simple to moderately complex formats. I've been using an approach of slurping the file in and using a number of regex tokens along with \G and /gc to parse through the resulting string. I've run into an issue with /g that I'd like to get advice on.

Here's a simple example:

use strict; use warnings; my $data =<<'END'; x = 10; y = 12; z = 100; END sub parse { my $data_ref = shift; my $rgx = qr/(\w+)\s*=\s*(\d+)\s*;\s*/; my @m = $$data_ref =~ /\G$rgx/gc; return \@m; } while (my @matches = @{parse(\$data)}) { while (my $var = shift @matches) { my $val = shift @matches; print "I got variable ($var) set to value ($val)\n"; } print "Trying next parse...\n"; }
The output is:
I got variable (x) set to value (10) I got variable (y) set to value (12) I got variable (z) set to value (100) Trying next parse...
I would *like* the output to be:
I got variable (x) set to value (10) Trying next parse... I got variable (y) set to value (12) Trying next parse... I got variable (z) set to value (100) Trying next parse...
Ideally, I'd like to be able to handle these declarations one at a time. Get x, get 10, handle storing the data, then return to my string for parsing. However, in data formatted in a regular repeating fashion, the /g modifier results in multiple matches.

/g seems to have two distinct functions: 1) ensure match position is retained after a match and 2) allow multiple matches to occur. I'd like to be able to retain the position of the match without the secondary effect of allowing multiple matches. We have the /gc modifier that allows retaining the match position after a failed match. My documentation reading suggests there is no similar modifier to only retain match position on a successful match outside of /g and its additional functionality. The pos function also only seems to work if /g is used.

So my question. How do I retain both single token matching capability and /g's position tracking in the string? Note I'm particularly trying to avoid cutting the string itself up as string manipulation greatly slows the parsing.

Replies are listed 'Best First'.
Re: Perl regex \G /g and /gc
by ikegami (Patriarch) on Sep 17, 2014 at 02:23 UTC
    //g in list context finds all (the remaining) matches. Use it in scalar (or void) context to find the next match.
      Exactly the information I was looking for. For some reason context didn't cross my mind. Thanks as always Ikegami :)
      Context! I've just spent hours trying to figure out why a /\G/g wasn't doing what I wanted; finally stumbled across this. Context! I may have overlooked it but I think the docs fail to spell that behavior out clearly. Or anyway, not blatantly enough for me to catch on.
Re: Perl regex \G /g and /gc
by Anonymous Monk on Sep 17, 2014 at 00:12 UTC

    see Re: Count Quoted Words, Re: Help required in find command. (read parse file tokenize m//gc), Re^2: Help with regular expression ( m/\G/gc ), perlfaq6#What good is \G in a regular expression? , Re^2: POD style regex for inline HTML elements

    #!/usr/bin/perl -- use strict; use warnings; use Data::Dump qw/ dd /; sub TRACE; sub DEBUG; *TRACE = *DEBUG = sub { print STDERR @_,"\n" }; my $data =<<'END'; # yo x = 10; y = 12; z = 100; junk END my $matches = dadata( \$data ); dd( $matches ); while( @$matches ){ my $ma = shift @$matches ; dd( $ma ); } exit( 0 ); sub dadata { my( $dataref ) = @_; my @matches; pos( $$dataref ) = 0; while( length( $$dataref ) > pos( $$dataref ) ){ $$dataref =~ m{\G^(#.*$)}gcm and do { push @matches, [ "COMMENT", $1 ]; TRACE "# COMMENT $1"; next; };; $$dataref =~ m{\G^\s*(\w+)\s*=\s*(\d+)\s*;\s*$}gcmx and do { push @matches, [ "KV", $1 , $2 ]; TRACE "# K($1)=V($2)"; next; };; $$dataref =~ m{\G(\s+)}gcxs and do { push @matches, [ "SPACE", $1 ]; next; };; $$dataref =~ m{\G(\S)}gcxs and do { push @matches, [ "INCH", $1 ]; TRACE "# INCH($1)"; next; };; } return \@matches; } __END__ # COMMENT # yo # K(x)=V(10) # K(y)=V(12) # K(z)=V(100) # INCH(j) # INCH(u) # INCH(n) # INCH(k) [ ["COMMENT", "# yo"], ["SPACE", "\n"], ["KV", "x", 10], ["SPACE", "\n"], ["KV", "y", 12], ["SPACE", "\n"], ["KV", "z", 100], ["SPACE", "\n"], ["INCH", "j"], ["INCH", "u"], ["INCH", "n"], ["INCH", "k"], ["SPACE", "\n"], ] ["COMMENT", "# yo"] ["SPACE", "\n"] ["KV", "x", 10] ["SPACE", "\n"] ["KV", "y", 12] ["SPACE", "\n"] ["KV", "z", 100] ["SPACE", "\n"] ["INCH", "j"] ["INCH", "u"] ["INCH", "n"] ["INCH", "k"] ["SPACE", "\n"]