This was actually the direction the mailing-list discussion took. The final suggestion was that you'd need two sets of data - one set that should match, and one set that shouldn't match. The scoring function would be a combination betwen correctly matching those that should match, and correctly *not* matching those that shouldn't.
-Blake
| [reply] |
Other possibilities for scoring that we've throught about are: the length
of the match - regexes that match more of an example are scored higher,
and specificity - regexes that are more specific are
scored higher (qr/^[A-Z]{2}$/
is more specific than qr/^\w+$/, qr/^.+$/
is so non-specific, that we don't even consider it valid).
Of course, this points out another weakness in the approach the
example code uses - it only considers left-anchored regexes, so it
tends not to notice commonalities on the right hand side (or anywhere
else in the data for that matter).
I'm not saying we've got the problem solved, or that it's even tractable
in the general case. We just have an approach that works for some cases.
| [reply] [d/l] [select] |
Expanding on the idea of multiple data sets with
something I forgot earlier:
Traditionally, when you're teaching a program to do
something, you use two data sets: a training set, which
is properly marked ("this should match", "this shouldn't",
etc), and a test set, which is also marked. You
don't want to train the program on all the data at
once, because you run the risk of overfitting (i.e. you
get a program that does really well at matching the training
data set, but is so specific to the training data that it
fails on real-world data).
--
:wq
| [reply] |
I think the regex engine provides enough hooks that you could write such a function w/o access to the underlying C code. For instance, some automatically placed (?{}) assertions might allow the scoring routine to offer "partial credit" for regexes that only match part of the string. Therefore, you could allow for a finer granularity than just 1=match 0=nomatch.
(?{}) is a relativly new feature that allows arbritrary code to be executed inside your regex..... it is (ab)used in the
rebug regex debugger that japhy mentioned a while back.
-Blake
| [reply] [d/l] [select] |