Re: More Variable length regex issues

I too have often wished that capturing brackets inside a repeat group would capture to successively higher $n vars.

Actually, I wish that all the captures were made available via a magic array -- @^N seems a likely candidate given recent enhancements to the regex engine -- and that repeat group captures worked logically.

What you seem to want to do is to parse something like this with a regex

a fixed bit: a,variable,length, repeated, bit [some more fixed stuff]
more fixed: more,variable,stuff [more fixed]
[download]

A repeat group allows you match this easily enough, but trying to capture all of the individual bits at the same time isn't. Which is a pain.

I think that probably the simplest (and probably most portable) way of doing this is to capture the variable bit to a single$n var on the first pass and break out the individual bits from there

while( my $data = <DATA> ) {
    $data =~ m[^
        ( [\w\s]+ ) :
        ( [^\x5b]+ )  \x5b
        ( [^\x5d]+ )  \x5d
    ]x;
    my ($first_bit, $last_bit) =( $1, $3 );
    my @variable_bits =  $2 =~ m[(\w+)[,\s]]g;
    print "$first_bit: (@variable_bits) [$last_bit]";
}
[download]

That said, if you were using Perl 5.6(?) or later, then there is another way of doing this:

#! perl -slw
use strict;
use re 'eval';

our ($num, $firstwords, $bracketed, $label, @bits, $pre_bit, $in_bit, 
+$post_bit);

my $re = qr[
    (?{
        our($num, $firstwords, $bracketed, $label, $pre_bit, $in_bit, 
+$post_bit, @bits)
            = ( (undef) x 7, () );
    })
    (\d+) :                             (?{ our $num        = $^N })
    ([^\x5b]+?) \x5b                    (?{ our $firstwords = $^N })
    ([^\x5d]+?) \x5d                    (?{ our $bracketed  = $^N })
    ([^:]+) : \s*                       (?{ our $label      = $^N })
    (?x-ism: ( [^,\s]+? ) [,\s]         (?{ push our @bits,   $^N }) )
++?
    \s* \x5b
        (\w+) \(                        (?{ our $pre_bit    = $^N })
        ([\w ]+) \)                     (?{ our $in_bit     = $^N })
        (\w+)                           (?{ our $post_bit   = $^N })
    \x5d
]x;

while( <DATA> ) {
    print "$num : $firstwords [ $bracketed ] $label : [@bits] [ $pre_b
+it ( $in_bit ) $post_bit ]"
        if $_ =~ $re;
}

__DATA__
1: or more [semi-fixed] fields: and,some,variable,length,stuff [more(f
+ixed)stuff]
2: kind of [similarly] formated: records,with,variable,differences [em
+bedded(in the)records]
[download]

I know this feature is still labelled 'experimental', but I'd be surprised if it goes away. It seems really useful to me, but I doubt it has made it into many of the perl regex clones yet?

Whether this is worth the effort to avoid the second regex is doubtful for the simple instances shown, but on more complicated records, this ability to capture disperate and variable parts directly into named (even if global) vars has distinct advantages.

Note: My use of \x5b & \x5d isn't an affectation. There seems to be a bug in the regex engine (5.8 at least) the means that using m[ ( [^[]+ ) \[ ]x; or m[ ( [^]]+ ) \] ]x; (which I think ought to work) confuses the regex engine. This is true even if I escape the '[' and ']' within the character classes. Interestingly, it complains that the parens are unbalanced. I haven't tied down the exact circumstances yet, but if anyone else has encountered a similar problem I'd be interested in hearing from them.

Examine what is said, not who speaks.

"Efficiency is intelligent laziness." -David Dunham
"When I'm working on a problem, I never think about beauty. I think only how to solve the problem. But when I have finished, if the solution is not beautiful, I know it is wrong." -Richard Buckminster Fuller

Comment on Re: More Variable length regex issues Select or Download Code

Replies are listed 'Best First'.
Re: Re: More Variable length regex issues by japhy (Canon) on Jun 10, 2003 at 17:02 UTC
The reason for your bug (in your Note:) is because when Perl is FIRST parsing your code, and it tries to determine where your regex starts and ends, it only looks for balanced square brackets. At that stage, it's not actually parsing your regex, just looking for its start and end. Thus, the square brackets IN the regex that aren't backslashed throw the parser off. _____________________________________________________ Jeff`[japhy]`Pinyan: Perl, regex, and perl hacker, who'd like a job (NYC-area) `s++=END;++y(;-P)}y js++=;shajsj<++y(p-q)}?print:??;`	[reply]
Re: Re: Re: More Variable length regex issues by BrowserUk (Patriarch) on Jun 10, 2003 at 17:31 UTC
Thanks for that. So the answer is, don't use square brackets as delimiters if the regex contains (unbalanced) square brackets. It's a shame that there aren't a couple more sets of balanced brackets in the arsenal:) Preferably a pair that could be used soley for quote-like delimiting. Maybe now we have unicode, we could find a pairing that wouldn't get overloaded for 7 other things too? Some chance I think:) Examine what is said, not who speaks. "Efficiency is intelligent laziness." -David Dunham "When I'm working on a problem, I never think about beauty. I think only how to solve the problem. But when I have finished, if the solution is not beautiful, I know it is wrong." -Richard Buckminster Fuller	[reply]
Re: Re: Re: Re: More Variable length regex issues by japhy (Canon) on Jun 10, 2003 at 20:41 UTC
Well, there are the French/German style quotes that look like << and >>... but Perl 6 uses them (believe it or not!). You can't use anything other than paren, bracket, brace, or angle-bracket in Perl right now to balance, since they're hardcoded into the parser/tokenizer. But you could certainly choose some obscure unicode character that doesn't need balancing. Actually, if you could tell the parser to include a pair of your own, like ` and ', that'd be cool. _____________________________________________________ Jeff`[japhy]`Pinyan: Perl, regex, and perl hacker, who'd like a job (NYC-area) `s++=END;++y(;-P)}y js++=;shajsj<++y(p-q)}?print:??;`	[reply]