text string approxiamtions (concept for review)by shemp (Deacon)
|on Dec 04, 2002 at 23:48 UTC||Need Help??|
shemp has asked for the wisdom of the Perl Monks concerning the following question:
A lot of background info here, the real question is, what do people think of the algorithm i'm about to describe, (including the hash of arrayrefs).
I'm working on a name parser, where the name information is really ownership information. This is not very clean, in the sense that "people names" and other name-related info, such as business names, and ownership information are mixed together. For simplicity, lets just say im trying to identify if a string is "people name(s)" or other info
So, i have a list of keywords that deonte non-people names, such as "Associates, Corp, Trust, Rental", etc. Now, the problem is that the name info im trying to parse is pretty ugly, in that there are often (non-standard) abbreviations, particularly in the business words. So, I've tried Lingua::Stem, with limited success, as well as some home-cooked algorithms. I want to try applying some Levenshtein like logic also, but this needs to be fast, as i process millions of names, and what is being described here is only a (logically) small part of the process.
So, i'm thinking of storing my keyword list as follows. Since each keyword will be at least 3 chars, a hash of arrayrefs seems good, where the top-level keys are the first 2 letters of each keyword, and the arrayref associated contains the remaining letters of the words. for example:
With this, i can substring the beginning of each token in the name string, and use %list to find possible keywords matches, then use a Levenshtein distance measure (and other algorithms) to determine the likelihood of what the information is supposed to represent.
A cool thing that hashes do not do, would be to allow a regex for a key. This way, one wouldnt have to cycle through the keys, and apply the regex to each. The Perl internal hashing algorithm could efficiently apply the regexing, so we could do something like: