comment on

As another thought for you, consider Levenshtein_distance. Also the Perl module, Levenshtein.pm.

I think you are just looking for what are called N1 errors, that means: one substitution, or one deletion or one addition. Life gets complicated if you working with a subset of N2 that allows transpositions, that would be for example, RATE and ARTE would match, but that counts as 2 errors. Note 2 errors could also be RATE vs XATB which is completely different. So Levenshtein == 2 can produce a lot of false "positives".

The last pattern matching code that I wrote, analyzed the input string, then generated a regex dynamically which was compiled and run against the candidate strings. A lot of "yeah, but's" with code like that and very specific to my particular application. Just a comment that something like that is possible in Perl (have Perl write a program (a REGEX), and then run it).

I think that the standard Levenshtein module will do most of what you want and I would start there. There are links to other "string compares" in the doc's at Levenshtein.pm. If somebody else has built the wheel that you need, I would use it. Some of these things have XS modules which will run much more quickly than native Perl.

Update: I thought that I should mention yet another approximate matcher agrep, agrep wiki. The algorithms are top notch and agrep is fast. In a lot of cases it will outperform standard grep for simple matches.

The caveat here is that I am unsure of the code status. I was using agrep for all of my grepping until I caught it missing a match! I spent several days reducing my dataset into a "smallest" reproduceable error report and talked with the mantainer. He verified the problem, but indicated the difficulty of a fix. That was about a decade ago and I'm not sure what happened (i.e. whether the "Marshall" fix got implemented or not?). If it didn't then very rarely agrep will miss a match that it should get, even when used like standard grep. With that caveat, agrep is pretty cool. Certainly the algorithms are.

In reply to Re: Comparing Lines within a Word List by Marshall
in thread Comparing Lines within a Word List by dominick_t

Are you posting in the right place? Check out Where do I post X? to know for sure.
Posts may use any of the Perl Monks Approved HTML tags. Currently these include the following:
<code> <a> <b> <big> <blockquote> <br /> <dd> <dl> <dt> <em> <font> <h1> <h2> <h3> <h4> <h5> <h6> <hr /> <i> <li> <nbsp> <ol> <p> <small> <strike> <strong> <sub> <sup> <table> <td> <th> <tr> <tt> <u> <ul>
Snippets of code should be wrapped in <code> tags not <pre> tags. In fact, <pre> tags should generally be avoided. If they must be used, extreme care should be taken to ensure that their contents do not have long lines (<70 chars), in order to prevent horizontal scrolling (and possible janitor intervention).
Want more info? How to link or How to display code and escape characters are good places to start.


The stupid question is the question not asked
	PerlMonks