comment on

This sounds like a good problem domain for genetic algorithms, which I unfortunately don't know much about. (The machine learning course I took at uni was supposed to get into GAs, but of course we ran out of time....)

Here's the basic theory: (for more info, look at geneticprogramming.com)

Generate a population of possible solutions pretty much at random.
Run some sort of fitness test on the solutions (in this case, try matching them against your data, and see how many matches you get, and how close those matches are).
Generate a new population: copy the best solutions over verbatim, mutate some of the solutions, and "breed" (cross-over) some of the solutions.
Repeat until you get a "close enough" solution.

gumpu has done some genetic programming in Perl before.

It strikes me that, since Perl's regexes are built around a backtracking finite automaton, it might be possible to analytically compute a regex from "representative" data, using either some sort of search or constraint-satisfaction techniques. It also seems plausible that you'd be able to use Markov chains to describe the data: I've seen this technique work fairly well at finding potential coding regions in DNA, which looks like a similar problem (looking for patterns in connected data), and since Markov processes are pretty close to state machines, they might mesh well with the regex engine....

Great problem! Thanks for bringing this up. If I have time today, I'll hunt down some useful-looking papers and update this node.

-- :wq

In reply to Re: generating regexes? by FoxtrotUniform
in thread generating regexes? by mortis

Are you posting in the right place? Check out Where do I post X? to know for sure.
Posts may use any of the Perl Monks Approved HTML tags. Currently these include the following:
<code> <a> <b> <big> <blockquote> <br /> <dd> <dl> <dt> <em> <font> <h1> <h2> <h3> <h4> <h5> <h6> <hr /> <i> <li> <nbsp> <ol> <p> <small> <strike> <strong> <sub> <sup> <table> <td> <th> <tr> <tt> <u> <ul>
Snippets of code should be wrapped in <code> tags not <pre> tags. In fact, <pre> tags should generally be avoided. If they must be used, extreme care should be taken to ensure that their contents do not have long lines (<70 chars), in order to prevent horizontal scrolling (and possible janitor intervention).
Want more info? How to link or How to display code and escape characters are good places to start.


laziness, impatience, and hubris
	PerlMonks