in reply to Analysis of Regular Expressions

Three possibilities on creating a regex-generality-rating come into my mind:

  1. generate all strings of a given length L and count how many are matched by the regex
  2. use the regex to calculate the number of all possible strings (up to a given length?!)
  3. do some inexact estimation

1. would be easy to implement (generate all strings, match) - but is probably much too slow (just counting the letters and numbers would give you 26+26+10 different characters at each position of your string ...
... so this approach is not feasible :-(

2. is more tricky - you need to analyze the regex, but this is possible (see the previous answers, or here for some tools for this.)

I would then calculate two ratings for this and combine them in a suitable way (however suitable looks like ;-)

However I see some problems with this approach:

3. If I had the task you describe, I'd first check my possible input-regexes (maybe most of them are just simple?) and check if there is an issue if some are not ordered correctly. Probably it is possible to create a simplified formula that works well in most cases ...

HTH, Rata (very interested on how you sort this problem out!)

Replies are listed 'Best First'.
Re^2: Analysis of Regular Expressions
by JavaFan (Canon) on Mar 17, 2010 at 10:59 UTC
    1. would be easy to implement (generate all strings, match) - but is probably much too slow (just counting the letters and numbers would give you 26+26+10 different characters at each position of your string ...

    Only 62 letters and numbers? That's so limiting.... That doesn't even cover most European languages. Currently, Perl has thousands of characters that are considered letters and digits.

      Or any of the official North American languages. It doesn't cover English (naïve), French (fête) or Spanish (señor).
Re^2: Analysis of Regular Expressions
by PetaMem (Priest) on Mar 17, 2010 at 17:16 UTC

    Rata,

    ok, it's clear that solution 1) has exponential runtime and space requirements (size of the charset doesn't even matter - (well until JavaFan points out that a charset of 1 char would...) )

    I basically went for solution 2) before, but only with some half-baked heuristic constructs, so probably a more high-level description of the problem is in place: I'm working on a chatbot demo which has to catch some phrases.

    Thats a simple pattern matching just before any parsing, and delivers decent speed. And the whole point of this discussion and my endeavour is, that the set of the regular expressions gets extended and it is not feasible (at least I haven't come up with a feasible way) to sort them by "do-what-i-mean" manually.

    Doing a trivial "sort by length" has actually worked very well for a set of up to 500-600 rules. But now I start getting false matches ONLY because some "more generic" regular expressions are handled before the more specific ones.

    Under the assumption, that (a|b|c) is more generic than (a), it is clear, that the simple "by length" ordering will result in the wrong order, because I want to process the "specific" rules first, and the "generic" last.

    What has helped:

    • removing e.g. named matches from length considerations, making e.g. (?<name>...) 9 chars shorter
    • Having an alternation like (a|b|c) being considered as just having length of 1, same for character classes
    • actually subtracting a length of 1 for every wildcard.

    Maybe the best solution could be to give every rule an identifier (which has happened anyway), and add some information "try this before ruleX". Which is starting to look like a parser. :-)

    Bye
     PetaMem
        All Perl:   MT, NLP, NLU

      Hi PetaMem!

      Thanks for that additional information! Chatbots are a really interesting topic!

      Looking at your specific problem (and not the general generalicity of regexes) makes me wonder if the empirical approach (do a check on how many possible strings are matched by the regex) really doesn't work. However with some modifications:

      • Your problem doesn't seem to be real-time - and the order of your rules seem to be static:
        therefore you could do the generalicity-rating in some offline preprocessing phase, e.g. by letting your computer work on it all night/weekend/holliday
      • Instead of looking at all possible input strings, just narrow it to the expected data. You will probably have tons of chat-logs which you can use
        So your rating could be: how many matches are found by a regex in a given set of logs

      Regarding the high number of possible characters (as also mentioned by JavaFan): Here you could do some preprocessing, e.g. by replacing the german A-Umlaut by ae. This will also help with some other "languages" like leetspeak ... your chatbot seems to get confused when greeted by a friendly "h3110" ;-)

      However I fear that any automatic ordering would just be one criteria for determining the order of the rules. You will probably add additionally some rating done by a human. And the idea of giving additional context to the rule (like you wrote: try this before ruleX ...) sounds great to me. Have you also experimented with randomness? (Using a random order in case of several rules have a similar rating.) That way the answers might not be so predictable...

      HTH, Rata

Re^2: Analysis of Regular Expressions
by LanX (Saint) on Mar 17, 2010 at 13:47 UTC
    > # use the regex to calculate the number of all possible strings (up to a given length?!)

    you know it's perfectly possible to construct valid regexes which won't terminate before all suns have burned out...

    So your empirical approach is only practicable by limiting the regexes to a reasonable subset.

    One thinkable way could be limiting to pure state machines and exploring feasible transits between states.

    But this is highly speculative without a detailed question from the OP, IMHO it's a case of XY!

    Cheers Rolf