palandir has asked for the wisdom of the Perl Monks concerning the following question:

Hi.
I'm more of a newbie Perl programmer (1st post here also).
Today I (actually somebody else, but I'm posting this here now ;)) stumbled across some strange problem with regular expressions.

The code is simple:
$_ = '511111'; print if /^(\d)\d{\1}$/;
The first digit of the string in $_ indicates the number of digits following.
(For example, 233 should match, 42143 also, but 324 should not.)
However, the code above does not work.
We know how to work around this problem, the problem is NOT to find a solution for this. We simply want to know WHY this does not work.
It seems that this is not something trivial.


Furthermore, we were wondering:
If the regexp above would work, what if the first TWO or more digits would indicate how many digits follow? (meaning that >= 10 digits follow)
How could you stuff that into one regexp? (it's also just a theoretical question, but an interesting one IMHO).

Thanks in advance.

Replies are listed 'Best First'.
Re: RegExp madness
by tilly (Archbishop) on May 16, 2004 at 20:50 UTC
    Two answers.

    The simple one is that it doesn't work because \1 in a regular expression means (or tries to mean) "match whatever was captured in the first capturing parens right here". Where match means string match. And the attempt at specifying the count doesn't match the form of a repeated match, so the braces are interpolated literally. So your regular expression will match the string "5{5}".

    The theoretical reason that it can't work because of how the regular expression engine works. It has to compile the regular expression, and then run it. At compilation time it has to figure out things like \d{5} because that changes what the regular expression engine compiles the expression to, and it doesn't know then that \1 will be 5. At run-time it can figure out the meaning of \1, but at run-time it only knows how to interpolate \1 in, not how to recompile the regular expression on the fly.

Re: RegExp madness
by CombatSquirrel (Hermit) on May 16, 2004 at 21:14 UTC
    I wouldn't consider myself a Perl expert, so I won't comment on the others' replies, which certainly are correct. However, Perl (the newer versions anyways) supports dynamic RegEx creation, so you might want consider using the following to get the desired results (it does seem to work ;-):
    $_ = '511111'; print /^(\d)(??{"\\d{$1}"})$/ ? 'yay' : 'oops';
    Hope this helped.
    CombatSquirrel.
    Entropy is the tendency of everything going to hell.
Re: RegExp madness
by matija (Priest) on May 16, 2004 at 20:46 UTC
    In order to match the regexp, Perl construct a special kind of construct called an finite automata (I don't know enough about the innards to tell if it's a deterministic or a nondeterministic one, but it doesn't really matter).

    In order to construct it, it needs to know some things. A zero-or-more repeats is simple to construct, as is one-or more (they take very few nodes). X-repeats (where X is a known number) involve creating more nodes (on the order of X). But you have to know how many, or you can't construct the automata. And that's why it doesn't work..

    As for your second question, you would simply replace (\d) with (\d\d) - there is no reason why \1 needs to be just one character...

Re: RegExp madness
by haoess (Curate) on May 16, 2004 at 20:57 UTC

    Not everything must be done with a regexp:

    print if substr ($_, 0, 1) == length substr ($_, 1);

    -- Frank

Re: RegExp madness
by dave_the_m (Monsignor) on May 16, 2004 at 20:48 UTC
    print if /^(\d)\d{\1}$/; That would match if $_ equalled the string "71{7}" for example. Unless the {n,m} modifier has the correct syntax, it loses its special meaning and just becomes some more characters to match.
Re: RegExp madness
by Aristotle (Chancellor) on May 17, 2004 at 07:35 UTC

    CombatSquirrel's reply is one way to do it; note that it involves two separate regex compilations. Therefor, the way I'd prefer to do this would be

    print $1 if /^(\d)/g and /\G(\d{$1})/g

    Note the /g modifier on the first regex. Since it's called in scalar context, this makes the regex engine remember the end of the last match for that string; you can match that position in the next pattern using \G. This way, the second regex starts matching where the last one left off.

    If you're not looking for a match however, so much as for a way to handle data structures, then maybe unpack is your ticket.

    print unpack "A/A*", $_

    The A/ bit of the template tells unpack that you want to take one A, ie one ASCII character, as the length count to use for the next template. The A* will then grab that many ASCII characters from the rest of string. If you need to make sure they're digits, that will have to be an extra step. If the count is not a digit, the A* will grab zero characters. So maybe what you're doing is best done as

    my $num = unpack "A/A*", $_; print $num if length($num) and $num !~ /\D/;

    This is a little more awkward, but if the data structure you're dealing with is larger, it scales much better: you can put all of the data structure extraction into a single unpack template, and then validate the various bits you pulled out with individual patterns.

    The multi-regex solution on the other hand becomes messier as the structure becomes larger, because extraction and validation are intermixed.

    Makeshifts last the longest.

Re: RegExp madness
by ysth (Canon) on May 16, 2004 at 20:54 UTC
    Your {} can't take a backreference; that's not supported. So the {} doesn't look like a quantifier and the curly braces will be treated as literal characters to match. So: "34{3}"=~/^(\d)\d{\1}$/ succeeds instead of your desired "3444"=~/^(\d)\d{\1}$/.