in reply to Analysis of Regular Expressions

So in other words, a regular expression is more specific than another, if it "matches less" than that other rx.

Well, apart from the fact that this question isn't even well formulated (for instance, all the given regexes, except the first one, match an countable, infinite set of strings - and you cannot order such sets on size), I doubt the question is even computable. I don't think even if the simpler question "given two (Perl) regexes, determine whether they match the same set of strings" is computable.

Oh - yes and the computation of the "genericity/specifity" should be fast.
Hahahahaha. It's a little early for April 1.

Replies are listed 'Best First'.
Re^2: Analysis of Regular Expressions
by LanX (Saint) on Mar 17, 2010 at 11:43 UTC
    > > Oh - yes and the computation of the "genericity/specifity" should be fast.

    > Hahahahaha. It's a little early for April 1.

    LOL, JavaFan++ =)

    Cheers Rolf

      Uh.. boys... (JavaFan and LanX) gimme a break...

      I mean yes, I got my master in CS 13 years ago and there's no need to remind me of Gödel, the halting problem and similar.

      And YES, you are theoretically right. And NO, you are practically (especially JavaFan) way off. There are no countable infinite sets on finite computers, and if you have a FINITE SET of input strings, the cardinality of the sets that represent the results of the presented regexs may very well differ and thus be sortable. So your ivory tower arguments are mostly for the dustbin...

      I know, that for a Java Fan it might be very hard to adopt the Perl-inherent DWIM scheme when interpreting questions, but why LanX jumps on that bandwaggon is beyond me.

      Fotunately I got already some interesting hints in this thread and am working on a SOLUTION. Yes - will be presented for discussion.

      Bye
       PetaMem
          All Perl:   MT, NLP, NLU

        > I know, that for a Java Fan it might be very hard to adopt the Perl-inherent DWIM scheme when interpreting questions, but why LanX jumps on that bandwaggon is beyond me.

        Come on, keep cool! I just can't resist a good pun! =)

        Just think of me as too ignorant as to fully understand your question... ;-)

        > and if you have a FINITE SET of input strings, the cardinality of the sets that represent the results of the presented regexs may very well differ and thus be sortable. So your ivory tower arguments are mostly for the dustbin...

        true, but you only limited the sets of input strings and not the sets of regexes or allowed regex commands.

        IIRC Friedl (or perldocs?) describes in his book a regex with only two nested parens and quantifiers which would recursively backtrack for years!

        Without a detailed description we have to answer the question in generality, and fast solutions are not really likely!

        Cheers Rolf

        There are no countable infinite sets on finite computers, and if you have a FINITE SET of input strings, the cardinality of the sets that represent the results of the presented regexs may very well differ and thus be sortable.
        True.

        Let's do some arithmetic. Let's take a computer with 1 kB of memory - not unreasonable in 2010. For argument sakes, ignore any other memory, including disk space.

        Limiting the problem space to all strings that can be stored on disk, you'd have 2561024 finite strings to work with. 2561024 is a finite number. Let's assume that each yoctosecond (10-24s), you can analyse yotta-strings (1024 strings). In other words, each second, you'd be able to determine of 1048 ≅ 2160 strings whether they match the given regexes. Then you still need about 102418 seconds to examine them all. The age of the universe is estimated to be about 13.75 billion years, which is about 4.4 * 1017 seconds.

        So, yeah, there are no countable infinite sets on computers. But I don't think it's really practical to do a brute-force calculation of all the possible FINITE strings a computer can hold.

        OK...

        lets go explicit, the following (halting!) counterexamples use fairly short input strings and fairly simple regexes, nevertheless resulting in exploding run times...

        perl -e ' for $n(10..14) { $str = "x"x($n*2+1); $t=time; $str =~ /(x+x+){$n}y/; print "\$str:$str n:$n time:",time-$t,"sec\n" }' $str:xxxxxxxxxxxxxxxxxxxxx n:10 time:0sec $str:xxxxxxxxxxxxxxxxxxxxxxx n:11 time:3sec $str:xxxxxxxxxxxxxxxxxxxxxxxxx n:12 time:9sec $str:xxxxxxxxxxxxxxxxxxxxxxxxxxx n:13 time:39sec $str:xxxxxxxxxxxxxxxxxxxxxxxxxxxxx n:14 time:156sec

        I hope it's now obvious that an empiric brute force approach can only (if ever) work with rigid restrictions!

        (credits go to ratazong for finding the base principle in this example after some /msging and credits to his unwillingness to post it afterwards ... ;)

        Cheers Rolf

        UPDATE: improved code...

        UPDATE just for fun :)

        perl -e ' for $n(1..14) { $str = "x"x($n*3+1); $t=time; $str =~ /(x+x+x+){$n}y/; print "\$str:$str n:$n time:",time-$t,"sec\n" }' $str:xxxx n:1 time:0sec $str:xxxxxxx n:2 time:0sec $str:xxxxxxxxxx n:3 time:0sec $str:xxxxxxxxxxxxx n:4 time:0sec $str:xxxxxxxxxxxxxxxx n:5 time:0sec $str:xxxxxxxxxxxxxxxxxxx n:6 time:0sec $str:xxxxxxxxxxxxxxxxxxxxxx n:7 time:1sec $str:xxxxxxxxxxxxxxxxxxxxxxxxx n:8 time:9sec $str:xxxxxxxxxxxxxxxxxxxxxxxxxxxx n:9 time:73sec $str:xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx n:10 time:555sec ^C