Monks,
I came across this regex in my compilers course. It
isn't particularly advanced, in terms of Perl's regex
capabilities, but it's complex enough to give pause to most
of the students in this (fourthyear undergraduate) course.
Spend a few minutes figuring out, for instance, the ten
shortest strings that it'll match:
/^(01(01*0)*1)*$/
Bonus points if you notice a pattern in the set of
strings that it matches....
Update: Silly me. In the context of this course,
regexes describe languages, they don't match strings. In
that context, anchors aren't useful. (Finite
Automata match strings, and you can build FAs from regexes
automatically.) But I brought this example over to a Perl
context, and played with the terminology just a bit to make
it friendlier... and slipped up. Anchors added, thanks to
all who pointed it out.

F
o
x
t
r
o
t
U
n
i
f
o
r
m
Found a typo in this node? /msg me
The hell with paco, vote for Erudil!
Re: Regex refresher
by blakem (Monsignor) on Sep 12, 2002 at 07:23 UTC

Heh, looks like it will match every string in the world to me. Were there supposed to be anchors on the ends?
Update:
With anchors, it will match binary numbers divisible by 3. though I haven't figured out exactly why.
Update:
Here's a couple hints though. Left shifting a number divisible by 3 yields another number that is divisible by three. Adding two such numbers (obviously) still preserves this property. The innermost part of the regex is still a stumper though...
What kind of bits match /^(01*0)*$/, and why can you shove them between two '1's and get a number divisible by 3???
Blake
 [reply] [d/l] 

With anchors, it will match binary numbers divisible by 3. though I haven't figured out exactly why...
I think you are right, assuming the empty string is divisble by 3.
Here is why:
To start at the front of the regex, /^0*$/ will match a number made up of zero or more zeros. Assuming the empty string qualifies as being divisible by 3, all matching strings are divisble by 3: they all have the value 0.
Putting zeroes at the back of the number just results in numbers that are twice as large and thus divisible by 3, if the original number was.
Going to the other side of the alternation, and simplifying we see /^(11)*$/, that is: a string consisting of zero or more pairs of adjacent 1bits. It is easy to show that the value produced by every pair is divisible by 3: 2^{n+1} + 2^{n} = 3 * 2^{n}.
We've already seen that zero repetitions of the inner group of /^(1(01*0)*1)*$/ result in a number divisible by 3. Mixing substrings of two zeroes into a binary number will not alter its divisibilty by 3. The cases of adding to the front and back are obvious, but what happens when they are put in between? In that case the 1 representing 2^{n} gets to represent 2^{n+2}, which will just add 3 * 2^{n} to the sum of the digits.
What happens when we start using (and repeating) the 1 from the group in /1(01*0)1/? For every 1, you shift the leftmost 1 one position to the left, doubling its value and obtain a 1 representing half the value of the leftmost 1 before shifting. This adds 2^{n} + 2^{n1} = 3 * 2^{n1}. So the number is still divisible by 3.
I think that covers all variations... I could have dreamt typing in some of them, though. ;)
— Arien
 [reply] [d/l] [select] 

++ Very well explained. You've convinced me that this regex will match numbers divisible by three. How about the converse though? Can you show that if a number is divisible by three it will be matched by this regex? For example, in base10 /8$/ will only match even numbers, but it wont match all of them.
Blake
 [reply] [d/l] 
Re: Regex refresher
by abell (Chaplain) on Sep 12, 2002 at 07:20 UTC

I believe you should put a ^ and $ to start and end the regex, which otherwise matches everything.
Since the search space is pretty small, one can answer the shortest ten question with a brute force search, observing that only binary strings match:
#!/usr/bin/perl w
use strict;
my @strings = ('');
for my $a (1..10) {
for my $b (0..(2**$a1)) {
push @strings, sprintf ("%0".$a."b", $b);
}
}
print ">", join("<\n>", grep /^(01(01*0)*1)*$/, @strings);
That spits out
><
>0<
>00<
>11<
>000<
>011<
>110<
>0000<
>0011<
>0110<
>1001<
>1100<
>1111<
>00000<
>00011<
>00110<
>01001<
>01100<
...
Best regards
Antonio Bellezza  [reply] [d/l] 
Re: Regex refresher
by Nemp (Pilgrim) on Sep 12, 2002 at 11:38 UTC

Hello!
Cool thread :) ... but can somebody help me out please? I think my understanding/knowledge of regex's is sorely lacking because I'm not able to follow the descriptions in the answers above! To explain my problem...
First of all, the original answers I came up with on paper before looking at anyone else's replies were...
"", 01, 11, 0001, 1001, 00101, 10101, 000001, 100001, 001101
... which I'm assuming are the incorrect 10 from looking at the posts above and the amount of experience the posters share compared to me! :)
I wrote a little program which spits out the first answers in agreement with abell's answer above, but not in agreement with dws's.
I think my stumbling block is understanding how the regex /(01(01*0)*1)*/ matches the 00? Could somebody explain this to me? I've read a lot of the references but in my mind I'm looking at it like this...
<Incorrect, Please Explain Why?>
1) I take the 01 and figure that (except for the empty string) it will match any string that begins 0 or 1.
2) I take the (01*0)* and figure that it will match nothing, 00 or 0x0 where x is an arbitrary number of 1's.
3) I assume the last 1 means there has to be a 1 at the end of the regex?! I know this is wrong from the answers I'm getting back but I don't know why! argh!
</Incorrect, Please Explain Why?>
Scratching his head and hoping for help, Neil
UPDATE: I think zakb has hit the nail on the head for me with his answer :) ... Wow! Do I feel silly now ;) I *think* if you add the brackets that I was interpreting around the 01 combination it makes the answers I came up with correct :) Of course it doesn't help you spot the pattern that no longer exists ;)  [reply] [d/l] [select] 

Unless I'm hopelessly wrong, for starters I think you have step one wrong.
I think step one should be something like match either 0 or 1(01*0)*1. The final asterisk matches zero or more of one of the matched branch or atom so that's how it matches 00.
If this was your stumbling block, hopefully I've helped!
 [reply] [d/l] [select] 

I think I'm missing something as well. My logic follows yours, and I got *almost* the same answers (in my head, anyway) as dws'. Ours differ firstly because he repeated 1111 ;), but more importantly on the number 101. I can't see that number matching, as 1(01*0)*1 as far as I can tell would match 1, then choosing to continue into the middle group, it would at the very least match 00, not? The only quantifier is on the 1, so as I see it, 00 needs to be matched, then the final 1, so that choosing to include the middle group would yield at minimum 1001. Thusly, I came up with:
""
0
00
11
000
0000
1001
1111
00000
10101
Yet, I know that's not right (in hindsight). After looking over some other answers, I tried out 0011, 1100 and 0110, which all work out in practice, yet I can't figure out why. Is there some precedence issue that I'm missing? Eh..I've got some time to figure it out though, I'm only a 3rd year undergrad ;)  [reply] [d/l] [select] 


Re: Regex refresher
by dws (Chancellor) on Sep 12, 2002 at 07:20 UTC

Spend a few minutes figuring out, for instance, the ten shortest strings that it'll match
Here are the first 12
""
"0"
"00" "11"
"000" "101"
"0000" "1111" "1001" "1111"
"00000" "10101"
Bonus points if you notice a pattern in the set of strings that it matches....
All the strings are reversable?
(I'm assuming that anchors were implied.)
 [reply] [d/l] 

""
"0"
"00" "11"
"000" "101"
"0000" "1111" "1001" "1111"
"00000" "10101"
I regret to inform you that 101 is not matched, though
110 and 011 are. (abell's solution is correct.)
I'm a bit gratified that people seem to be having a
fairly difficult time with this (must be the
(01*0)* group in the middle of the righthand
alternative; I think people tend to see the 01 as both bound
by the *, rather than just the 1): means it's a useful
example.
Oh, and the completeness proof for this regex matching
multiples of three is fairly straightforward, once you have
a minimal DFA for matching this regex. (In short: your
minimal DFA has three states, each corresponding to a
modulus of 3. 0 is the only accepting state, and you can
prove by contradiction that all multiples of 3 end up in
the 0 state.) Maybe I'll write it up when my profs stop
assigning me papers. %)

F
o
x
t
r
o
t
U
n
i
f
o
r
m
Found a typo in this node? /msg me
The hell with paco, vote for Erudil!
 [reply] [d/l] [select] 
Re: Regex refresher
by demerphq (Chancellor) on Sep 12, 2002 at 13:59 UTC

Here the first 13 I think. But I dont see the pattern. What is it?
1 : '' # empty string.
2 : 0
3 : 00
4 : 11
5 : 000
6 : 011
7 : 110
8 : 0000
9 : 0011
10: 0110
11: 1001
12: 1100
13: 1111
 demerphq
my friends call me, usually because I'm late....
 [reply] [d/l] 
Re: Regex refresher
by PodMaster (Abbot) on Sep 12, 2002 at 14:30 UTC

When in pause, it means you don't quite understand what the pattern is saying, so it's best to get it explained
 [reply] [d/l] 
Re: Regex refresher
by bart (Canon) on Sep 13, 2002 at 13:06 UTC

Bonus points if you notice a pattern in the set of strings that it matches....
Well, it smells a bit like a Huffman encoding.
 [reply] 
Re: Regex refresher
by brianarn (Chaplain) on Sep 13, 2002 at 16:27 UTC

To be honest, I'm just horrible at anything slightly complex with a regex  but after reading the peoples' sets of the first matching strings, the pattern I see looks like it's all of even parity (even number of 1 bits). That's what I see in it, and am now trying to wrap my brain around the regex.
~Brian  [reply] 

