in reply to generating regexes?

This sounds like a good problem domain for genetic algorithms, which I unfortunately don't know much about. (The machine learning course I took at uni was supposed to get into GAs, but of course we ran out of time....)

Here's the basic theory: (for more info, look at geneticprogramming.com)

  1. Generate a population of possible solutions pretty much at random.
  2. Run some sort of fitness test on the solutions (in this case, try matching them against your data, and see how many matches you get, and how close those matches are).
  3. Generate a new population: copy the best solutions over verbatim, mutate some of the solutions, and "breed" (cross-over) some of the solutions.
  4. Repeat until you get a "close enough" solution.
gumpu has done some genetic programming in Perl before.

It strikes me that, since Perl's regexes are built around a backtracking finite automaton, it might be possible to analytically compute a regex from "representative" data, using either some sort of search or constraint-satisfaction techniques. It also seems plausible that you'd be able to use Markov chains to describe the data: I've seen this technique work fairly well at finding potential coding regions in DNA, which looks like a similar problem (looking for patterns in connected data), and since Markov processes are pretty close to state machines, they might mesh well with the regex engine....

Great problem! Thanks for bringing this up. If I have time today, I'll hunt down some useful-looking papers and update this node.

--
:wq

Replies are listed 'Best First'.
Re: (FoxUni) Re: generating regexes?
by atlantageek (Monk) on Nov 20, 2001 at 01:43 UTC
    The only problem with the Genetic Algorithm approach is the scoring function. There are two possible types of bad answers.
    1. RE does not match at all
    2. RE matches too much (ie /^.*$/)

    You need a function to tell you if answer a was closer than b even though they were type 1 answers or a matches a small number of sets than b even though they were type 2 answers. It seems that you would have to have access to the actual RE code in perl to write such a function.
    ----
    I always wanted to be somebody... I guess I should have been more specific.
      This was actually the direction the mailing-list discussion took. The final suggestion was that you'd need two sets of data - one set that should match, and one set that shouldn't match. The scoring function would be a combination betwen correctly matching those that should match, and correctly *not* matching those that shouldn't.

      -Blake

        Other possibilities for scoring that we've throught about are: the length of the match - regexes that match more of an example are scored higher, and specificity - regexes that are more specific are scored higher (qr/^[A-Z]{2}$/ is more specific than qr/^\w+$/, qr/^.+$/ is so non-specific, that we don't even consider it valid).

        Of course, this points out another weakness in the approach the example code uses - it only considers left-anchored regexes, so it tends not to notice commonalities on the right hand side (or anywhere else in the data for that matter).

        I'm not saying we've got the problem solved, or that it's even tractable in the general case. We just have an approach that works for some cases.

        Expanding on the idea of multiple data sets with something I forgot earlier:

        Traditionally, when you're teaching a program to do something, you use two data sets: a training set, which is properly marked ("this should match", "this shouldn't", etc), and a test set, which is also marked. You don't want to train the program on all the data at once, because you run the risk of overfitting (i.e. you get a program that does really well at matching the training data set, but is so specific to the training data that it fails on real-world data).

        --
        :wq
      I think the regex engine provides enough hooks that you could write such a function w/o access to the underlying C code. For instance, some automatically placed (?{}) assertions might allow the scoring routine to offer "partial credit" for regexes that only match part of the string. Therefore, you could allow for a finer granularity than just 1=match 0=nomatch.

      (?{}) is a relativly new feature that allows arbritrary code to be executed inside your regex..... it is (ab)used in the rebug regex debugger that japhy mentioned a while back.

      -Blake