Beefy Boxes and Bandwidth Generously Provided by pair Networks
go ahead... be a heretic
 
PerlMonks  

Re: Re: (FoxUni) Re: generating regexes?

by blakem (Monsignor)
on Nov 20, 2001 at 02:25 UTC ( [id://126407]=note: print w/replies, xml ) Need Help??


in reply to Re: (FoxUni) Re: generating regexes?
in thread generating regexes?

This was actually the direction the mailing-list discussion took. The final suggestion was that you'd need two sets of data - one set that should match, and one set that shouldn't match. The scoring function would be a combination betwen correctly matching those that should match, and correctly *not* matching those that shouldn't.

-Blake

Replies are listed 'Best First'.
Re: Re: Re: (FoxUni) Re: generating regexes?
by mortis (Pilgrim) on Nov 20, 2001 at 03:11 UTC
    Other possibilities for scoring that we've throught about are: the length of the match - regexes that match more of an example are scored higher, and specificity - regexes that are more specific are scored higher (qr/^[A-Z]{2}$/ is more specific than qr/^\w+$/, qr/^.+$/ is so non-specific, that we don't even consider it valid).

    Of course, this points out another weakness in the approach the example code uses - it only considers left-anchored regexes, so it tends not to notice commonalities on the right hand side (or anywhere else in the data for that matter).

    I'm not saying we've got the problem solved, or that it's even tractable in the general case. We just have an approach that works for some cases.

Re(4): generating regexes?
by FoxtrotUniform (Prior) on Nov 20, 2001 at 02:40 UTC

    Expanding on the idea of multiple data sets with something I forgot earlier:

    Traditionally, when you're teaching a program to do something, you use two data sets: a training set, which is properly marked ("this should match", "this shouldn't", etc), and a test set, which is also marked. You don't want to train the program on all the data at once, because you run the risk of overfitting (i.e. you get a program that does really well at matching the training data set, but is so specific to the training data that it fails on real-world data).

    --
    :wq

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://126407]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others studying the Monastery: (6)
As of 2024-03-28 15:32 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found