Other possibilities for scoring that we've throught about are: the length
of the match - regexes that match more of an example are scored higher,
and specificity - regexes that are more specific are
scored higher (qr/^[A-Z]{2}$/
is more specific than qr/^\w+$/, qr/^.+$/
is so non-specific, that we don't even consider it valid).
Of course, this points out another weakness in the approach the
example code uses - it only considers left-anchored regexes, so it
tends not to notice commonalities on the right hand side (or anywhere
else in the data for that matter).
I'm not saying we've got the problem solved, or that it's even tractable
in the general case. We just have an approach that works for some cases. | [reply] [Watch: Dir/Any] [d/l] [select] |
Expanding on the idea of multiple data sets with
something I forgot earlier:
Traditionally, when you're teaching a program to do
something, you use two data sets: a training set, which
is properly marked ("this should match", "this shouldn't",
etc), and a test set, which is also marked. You
don't want to train the program on all the data at
once, because you run the risk of overfitting (i.e. you
get a program that does really well at matching the training
data set, but is so specific to the training data that it
fails on real-world data).
--
:wq
| [reply] [Watch: Dir/Any] |