This sounds like a good problem domain for genetic
algorithms, which I unfortunately don't know much about.
(The machine learning course I took at uni was supposed
to get into GAs, but of course we ran out of time....)
Here's the basic theory: (for more info, look at
geneticprogramming.com)
- Generate a population of possible solutions pretty
much at random.
- Run some sort of fitness test on the solutions (in
this case, try matching them against your data, and see
how many matches you get, and how close those matches
are).
- Generate a new population: copy the best solutions
over verbatim, mutate some of the solutions, and "breed"
(cross-over) some of the solutions.
- Repeat until you get a "close enough" solution.
gumpu has done some
genetic programming
in Perl before.
It strikes me that, since Perl's regexes are built
around a backtracking finite automaton, it might be
possible to analytically compute a regex from
"representative" data, using either some sort of search
or constraint-satisfaction techniques. It also seems
plausible that you'd be able to use Markov chains to
describe the data: I've seen this technique work fairly
well at finding potential coding regions in DNA, which
looks like a similar problem (looking for patterns in
connected data), and since Markov processes are pretty
close to state machines, they might mesh well with the
regex engine....
Great problem! Thanks for bringing this up. If I
have time today, I'll hunt down some useful-looking
papers and update this node.
--
:wq
Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
Read Where should I post X? if you're not absolutely sure you're posting in the right place.
Please read these before you post! —
Posts may use any of the Perl Monks Approved HTML tags:
- a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
| |
For: |
|
Use: |
| & | | & |
| < | | < |
| > | | > |
| [ | | [ |
| ] | | ] |
Link using PerlMonks shortcuts! What shortcuts can I use for linking?
See Writeup Formatting Tips and other pages linked from there for more info.