Based on the responses so far, I thought it would be useful to give a brief overview of the approach the example code uses.

It starts out by checking the training set to see if all of the exmples share a common word (determined by \b). If they do, it divides the training set into 2 sub-sets by splitting each of the examples based on the common word. It then attacks each sub-set independently.

The code then takes the first character from the first example, and produces a set of candidate regexes that match this character. The candidates are purposefuly orderd from most specific to least specific.

From the candidate set, it picks the most suitable candidate. Most suitable in this case means the first of the candidates that matches each of the examples from the training set. The selected regex is then appended to the current regex.

The current regex is then tried against each of the examples. If it matches each of the examples exactly (/^pat$/), the algorithm terminates.

If the current regex doesn't match, the remainder ($1 from: /^pat(.+)$/) is extracted, and the process continues refining the regex.

There is an artifical limit of 25 iterations in the code to stop it in case it's not reaching a viable regex.

There are several issues with the code. One is that the next candidate might match more of the current example, but match less of other examples (this usualy causes an infinite loop). Another issue is that the candidate pattern might match the current example, but no others, thus producing an empty candidate set for matching the other examples. This could be solved by backtracking (removing the last regex atom off of the current candidate), but would require a major reworking of the example code (and still may never terminate).

That's the basic high-level overview of how the sample code works. As suggested, a genetic programming approach would probably produce better results. For the problem domain I've been attacking, the example code has actualy worked rather well.

Kyle


In reply to Re: generating regexes? by mortis
in thread generating regexes? by mortis

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.