This is a copy of an email that I sent to phl.pm - I was prompted to additionaly post this here by Blake Mills, because he felt that perlmonks would be interested in it.

------

I've been doing a bit of reading on machine learning. One of the things I've been toying with is the ability to generate a regex to match a given example set of data. My particualr examples would be for things like phone numbers, or zip codes, or information that consists of single data elements.

I've looked on CPAN for any possible existing work, but haven't been able to find anything. Does anyone know of anything along the lines of what I'm describing? The Regexp package provides some common examples, but what I really want is a tool I can use to generate regexes for data in a generic, automated fashion.

I've tried writing some simplistic code, and it has some success with data that has a consistient format - though it creates some horrible looking regexes for less consistient data, and fails completely for inconsistient data. I'm almost embarassed to offer this up, but if you're interested the code I wrote to try this out is available here.

Any advice or pointers would be great.

Thanks,
Kyle R. Burton

Edited by footpad, ~Tue Nov 20 15:25:42 2001


In reply to generating regexes? by mortis

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.