RegExp madness

palandir has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: RegExp madness by tilly (Archbishop) on May 16, 2004 at 20:50 UTC
Two answers. The simple one is that it doesn't work because \1 in a regular expression means (or tries to mean) "match whatever was captured in the first capturing parens right here". Where match means string match. And the attempt at specifying the count doesn't match the form of a repeated match, so the braces are interpolated literally. So your regular expression will match the string "5{5}". The theoretical reason that it can't work because of how the regular expression engine works. It has to compile the regular expression, and then run it. At compilation time it has to figure out things like \d{5} because that changes what the regular expression engine compiles the expression to, and it doesn't know then that \1 will be 5. At run-time it can figure out the meaning of \1, but at run-time it only knows how to interpolate \1 in, not how to recompile the regular expression on the fly.	[reply]
Re: RegExp madness by CombatSquirrel (Hermit) on May 16, 2004 at 21:14 UTC
I wouldn't consider myself a Perl expert, so I won't comment on the others' replies, which certainly are correct. However, Perl (the newer versions anyways) supports dynamic RegEx creation, so you might want consider using the following to get the desired results (it does seem to work ;-): `$_ = '511111'; print /^(\d)(??{"\\d{$1}"})$/ ? 'yay' : 'oops';` [download] Hope this helped. CombatSquirrel. Entropy is the tendency of everything going to hell.	[reply] [d/l]
Re: RegExp madness by matija (Priest) on May 16, 2004 at 20:46 UTC
In order to match the regexp, Perl construct a special kind of construct called an finite automata (I don't know enough about the innards to tell if it's a deterministic or a nondeterministic one, but it doesn't really matter). In order to construct it, it needs to know some things. A zero-or-more repeats is simple to construct, as is one-or more (they take very few nodes). X-repeats (where X is a known number) involve creating more nodes (on the order of X). But you have to know how many, or you can't construct the automata. And that's why it doesn't work.. As for your second question, you would simply replace `(\d)` with `(\d\d)` - there is no reason why `\1` needs to be just one character...	[reply]
Re: RegExp madness by haoess (Curate) on May 16, 2004 at 20:57 UTC
Not everything must be done with a regexp: `print if substr ($_, 0, 1) == length substr ($_, 1);` [download] -- Frank	[reply] [d/l]
Re: RegExp madness by dave_the_m (Monsignor) on May 16, 2004 at 20:48 UTC
`print if /^(\d)\d{\1}$/;` That would match if $_ equalled the string `"71{7}"` for example. Unless the `{n,m}` modifier has the correct syntax, it loses its special meaning and just becomes some more characters to match.	[reply] [d/l] [select]
Re: RegExp madness by Aristotle (Chancellor) on May 17, 2004 at 07:35 UTC
CombatSquirrel's reply is one way to do it; note that it involves two separate regex compilations. Therefor, the way I'd prefer to do this would be `print $1 if /^(\d)/g and /\G(\d{$1})/g` [download] Note the `/g` modifier on the first regex. Since it's called in scalar context, this makes the regex engine remember the end of the last match for that string; you can match that position in the next pattern using `\G`. This way, the second regex starts matching where the last one left off. If you're not looking for a match however, so much as for a way to handle data structures, then maybe unpack is your ticket. `print unpack "A/A", $_` [download] The `A/` bit of the template tells unpack that you want to take one `A`, ie one ASCII character, as the length count to use for the next template. The `A` will then grab that many ASCII characters from the rest of string. If you need to make sure they're digits, that will have to be an extra step. If the count is not a digit, the `A` will grab zero characters. So maybe what you're doing is best done as `my $num = unpack "A/A", $_; print $num if length($num) and $num !~ /\D/;` [download] This is a little more awkward, but if the data structure you're dealing with is larger, it scales much better: you can put all of the data structure extraction into a single unpack template, and then validate the various bits you pulled out with individual patterns. The multi-regex solution on the other hand becomes messier as the structure becomes larger, because extraction and validation are intermixed. Makeshifts last the longest.	[reply] [d/l] [select]
Re: RegExp madness by ysth (Canon) on May 16, 2004 at 20:54 UTC
Your {} can't take a backreference; that's not supported. So the {} doesn't look like a quantifier and the curly braces will be treated as literal characters to match. So: "34{3}"=~/^(\d)\d{\1}$/ succeeds instead of your desired "3444"=~/^(\d)\d{\1}$/.	[reply]