http://qs1969.pair.com?node_id=197155

Monks,

I came across this regex in my compilers course. It isn't particularly advanced, in terms of Perl's regex capabilities, but it's complex enough to give pause to most of the students in this (fourth-year undergraduate) course. Spend a few minutes figuring out, for instance, the ten shortest strings that it'll match:

/^(0|1(01*0)*1)*$/

Bonus points if you notice a pattern in the set of strings that it matches....

Update: Silly me. In the context of this course, regexes describe languages, they don't match strings. In that context, anchors aren't useful. (Finite Automata match strings, and you can build FAs from regexes automatically.) But I brought this example over to a Perl context, and played with the terminology just a bit to make it friendlier... and slipped up. Anchors added, thanks to all who pointed it out.

--
F o x t r o t U n i f o r m
Found a typo in this node? /msg me
The hell with paco, vote for Erudil!

Replies are listed 'Best First'.
Re: Regex refresher
by blakem (Monsignor) on Sep 12, 2002 at 07:23 UTC
    Heh, looks like it will match every string in the world to me. Were there supposed to be anchors on the ends?

    Update: With anchors, it will match binary numbers divisible by 3. though I haven't figured out exactly why.

    Update: Here's a couple hints though. Left shifting a number divisible by 3 yields another number that is divisible by three. Adding two such numbers (obviously) still preserves this property. The innermost part of the regex is still a stumper though...

    What kind of bits match /^(01*0)*$/, and why can you shove them between two '1's and get a number divisible by 3???

    -Blake

      With anchors, it will match binary numbers divisible by 3. though I haven't figured out exactly why...

      I think you are right, assuming the empty string is divisble by 3.

      Here is why:

      To start at the front of the regex, /^0*$/ will match a number made up of zero or more zeros. Assuming the empty string qualifies as being divisible by 3, all matching strings are divisble by 3: they all have the value 0.

      Putting zeroes at the back of the number just results in numbers that are twice as large and thus divisible by 3, if the original number was.

      Going to the other side of the alternation, and simplifying we see /^(11)*$/, that is: a string consisting of zero or more pairs of adjacent 1-bits. It is easy to show that the value produced by every pair is divisible by 3: 2n+1 + 2n = 3 * 2n.

      We've already seen that zero repetitions of the inner group of /^(1(01*0)*1)*$/ result in a number divisible by 3. Mixing substrings of two zeroes into a binary number will not alter its divisibilty by 3. The cases of adding to the front and back are obvious, but what happens when they are put in between? In that case the 1 representing 2n gets to represent 2n+2, which will just add 3 * 2n to the sum of the digits.

      What happens when we start using (and repeating) the 1 from the group in /1(01*0)1/? For every 1, you shift the left-most 1 one position to the left, doubling its value and obtain a 1 representing half the value of the left-most 1 before shifting. This adds 2n + 2n-1 = 3 * 2n-1. So the number is still divisible by 3.

      I think that covers all variations... I could have dreamt typing in some of them, though. ;-)

      — Arien

        ++ Very well explained. You've convinced me that this regex will match numbers divisible by three. How about the converse though? Can you show that if a number is divisible by three it will be matched by this regex? For example, in base10 /8$/ will only match even numbers, but it wont match all of them.

        -Blake

Re: Regex refresher
by abell (Chaplain) on Sep 12, 2002 at 07:20 UTC
    I believe you should put a ^ and $ to start and end the regex, which otherwise matches everything.
    Since the search space is pretty small, one can answer the shortest ten question with a brute force search, observing that only binary strings match:
    #!/usr/bin/perl -w use strict; my @strings = (''); for my $a (1..10) { for my $b (0..(2**$a-1)) { push @strings, sprintf ("%0".$a."b", $b); } } print "->", join("<-\n->", grep /^(0|1(01*0)*1)*$/, @strings);
    That spits out
    -><-
    ->0<-
    ->00<-
    ->11<-
    ->000<-
    ->011<-
    ->110<-
    ->0000<-
    ->0011<-
    ->0110<-
    ->1001<-
    ->1100<-
    ->1111<-
    ->00000<-
    ->00011<-
    ->00110<-
    ->01001<-
    ->01100<-
    ...
    


    Best regards

    Antonio Bellezza
Re: Regex refresher
by Nemp (Pilgrim) on Sep 12, 2002 at 11:38 UTC
    Hello!

    Cool thread :) ... but can somebody help me out please? I think my understanding/knowledge of regex's is sorely lacking because I'm not able to follow the descriptions in the answers above! To explain my problem...

    First of all, the original answers I came up with on paper before looking at anyone else's replies were...
    "", 01, 11, 0001, 1001, 00101, 10101, 000001, 100001, 001101
    ... which I'm assuming are the incorrect 10 from looking at the posts above and the amount of experience the posters share compared to me! :)

    I wrote a little program which spits out the first answers in agreement with abell's answer above, but not in agreement with dws's.

    I think my stumbling block is understanding how the regex /(0|1(01*0)*1)*/ matches the 00? Could somebody explain this to me? I've read a lot of the references but in my mind I'm looking at it like this...

    <Incorrect, Please Explain Why?>
    1) I take the 0|1 and figure that (except for the empty string) it will match any string that begins 0 or 1.
    2) I take the (01*0)* and figure that it will match nothing, 00 or 0x0 where x is an arbitrary number of 1's.
    3) I assume the last 1 means there has to be a 1 at the end of the regex?! I know this is wrong from the answers I'm getting back but I don't know why! argh!
    </Incorrect, Please Explain Why?>

    Scratching his head and hoping for help,
    Neil

    UPDATE: I think zakb has hit the nail on the head for me with his answer :) ... Wow! Do I feel silly now ;)
    I *think* if you add the brackets that I was interpreting around the 0|1 combination it makes the answers I came up with correct :) Of course it doesn't help you spot the pattern that no longer exists ;)

      Unless I'm hopelessly wrong, for starters I think you have step one wrong.

      I think step one should be something like match either 0 or 1(01*0)*1. The final asterisk matches zero or more of one of the matched branch or atom so that's how it matches 00.

      If this was your stumbling block, hopefully I've helped!

        I think I'm missing something as well. My logic follows yours, and I got *almost* the same answers (in my head, anyway) as dws'. Ours differ firstly because he repeated 1111 ;), but more importantly on the number 101. I can't see that number matching, as 1(01*0)*1 as far as I can tell would match 1, then choosing to continue into the middle group, it would at the very least match 00, not? The only quantifier is on the 1, so as I see it, 00 needs to be matched, then the final 1, so that choosing to include the middle group would yield at minimum 1001. Thusly, I came up with:
        "" 0 00 11 000 0000 1001 1111 00000 10101
        Yet, I know that's not right (in hindsight). After looking over some other answers, I tried out 0011, 1100 and 0110, which all work out in practice, yet I can't figure out why. Is there some precedence issue that I'm missing?
        Eh..I've got some time to figure it out though, I'm only a 3rd year undergrad ;)
Re: Regex refresher
by dws (Chancellor) on Sep 12, 2002 at 07:20 UTC
    Spend a few minutes figuring out, for instance, the ten shortest strings that it'll match

    Here are the first 12

    "" "0" "00" "11" "000" "101" "0000" "1111" "1001" "1111" "00000" "10101"

    Bonus points if you notice a pattern in the set of strings that it matches....

    All the strings are reversable?

    (I'm assuming that anchors were implied.)

        "" "0" "00" "11" "000" "101" "0000" "1111" "1001" "1111" "00000" "10101"

      I regret to inform you that 101 is not matched, though 110 and 011 are. (abell's solution is correct.)

      I'm a bit gratified that people seem to be having a fairly difficult time with this (must be the (01*0)* group in the middle of the right-hand alternative; I think people tend to see the 01 as both bound by the *, rather than just the 1): means it's a useful example.

      Oh, and the completeness proof for this regex matching multiples of three is fairly straightforward, once you have a minimal DFA for matching this regex. (In short: your minimal DFA has three states, each corresponding to a modulus of 3. 0 is the only accepting state, and you can prove by contradiction that all multiples of 3 end up in the 0 state.) Maybe I'll write it up when my profs stop assigning me papers. %-)

      --
      F o x t r o t U n i f o r m
      Found a typo in this node? /msg me
      The hell with paco, vote for Erudil!

Re: Regex refresher
by demerphq (Chancellor) on Sep 12, 2002 at 13:59 UTC
    Here the first 13 I think. But I dont see the pattern. What is it?
    1 : '' # empty string. 2 : 0 3 : 00 4 : 11 5 : 000 6 : 011 7 : 110 8 : 0000 9 : 0011 10: 0110 11: 1001 12: 1100 13: 1111

    --- demerphq
    my friends call me, usually because I'm late....

Re: Regex refresher
by PodMaster (Abbot) on Sep 12, 2002 at 14:30 UTC
    When in pause, it means you don't quite understand what the pattern is saying, so it's best to get it explained
Re: Regex refresher
by brianarn (Chaplain) on Sep 13, 2002 at 16:27 UTC
    To be honest, I'm just horrible at anything slightly complex with a regex - but after reading the peoples' sets of the first matching strings, the pattern I see looks like it's all of even parity (even number of 1 bits). That's what I see in it, and am now trying to wrap my brain around the regex.

    ~Brian
Re: Regex refresher
by bart (Canon) on Sep 13, 2002 at 13:06 UTC
    Bonus points if you notice a pattern in the set of strings that it matches....
    Well, it smells a bit like a Huffman encoding.