comment on

Hello monks,

I've recently been doing a large amount of parsing and have wandered into a bit of a quandary related to regexes, \G, and /g.

My recent code has been some state based lexing/parsing approaches for files that range from relatively simple to moderately complex formats. I've been using an approach of slurping the file in and using a number of regex tokens along with \G and /gc to parse through the resulting string. I've run into an issue with /g that I'd like to get advice on.

Here's a simple example:

use strict;
use warnings;

my $data =<<'END';
x = 10;
y = 12;
z = 100;
END

sub parse {
    my $data_ref = shift;

    my $rgx = qr/(\w+)\s*=\s*(\d+)\s*;\s*/;
    my @m = $$data_ref =~ /\G$rgx/gc;
    return \@m;
}

while (my @matches = @{parse(\$data)}) {
    while (my $var = shift @matches) {
        my $val = shift @matches;
        print "I got variable ($var) set to value ($val)\n";
    }

    print "Trying next parse...\n";
}
[download]

The output is:

I got variable (x) set to value (10)
I got variable (y) set to value (12)
I got variable (z) set to value (100)
Trying next parse...
[download]

I would *like* the output to be:

I got variable (x) set to value (10)
Trying next parse...
I got variable (y) set to value (12)
Trying next parse...
I got variable (z) set to value (100)
Trying next parse...
[download]

Ideally, I'd like to be able to handle these declarations one at a time. Get x, get 10, handle storing the data, then return to my string for parsing. However, in data formatted in a regular repeating fashion, the /g modifier results in multiple matches.

/g seems to have two distinct functions: 1) ensure match position is retained after a match and 2) allow multiple matches to occur. I'd like to be able to retain the position of the match without the secondary effect of allowing multiple matches. We have the /gc modifier that allows retaining the match position after a failed match. My documentation reading suggests there is no similar modifier to only retain match position on a successful match outside of /g and its additional functionality. The pos function also only seems to work if /g is used.

So my question. How do I retain both single token matching capability and /g's position tracking in the string? Note I'm particularly trying to avoid cutting the string itself up as string manipulation greatly slows the parsing.

In reply to Perl regex \G /g and /gc by tj_thompson

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.