text string approxiamtions (concept for review)

shemp has asked for the wisdom of the Perl Monks concerning the following question:

A lot of background info here, the real question is, what do people think of the algorithm i'm about to describe, (including the hash of arrayrefs).
I'm working on a name parser, where the name information is really ownership information. This is not very clean, in the sense that "people names" and other name-related info, such as business names, and ownership information are mixed together. For simplicity, lets just say im trying to identify if a string is "people name(s)" or other info
So, i have a list of keywords that deonte non-people names, such as "Associates, Corp, Trust, Rental", etc. Now, the problem is that the name info im trying to parse is pretty ugly, in that there are often (non-standard) abbreviations, particularly in the business words. So, I've tried Lingua::Stem, with limited success, as well as some home-cooked algorithms. I want to try applying some Levenshtein like logic also, but this needs to be fast, as i process millions of names, and what is being described here is only a (logically) small part of the process.
So, i'm thinking of storing my keyword list as follows. Since each keyword will be at least 3 chars, a hash of arrayrefs seems good, where the top-level keys are the first 2 letters of each keyword, and the arrayref associated contains the remaining letters of the words. for example:

my %list = (
   "ass"   =>   ["ociation", "iates", "embly"],
   "res"   =>   ["ource", "tatement", "taurant"],
   etc
);
[download]

With this, i can substring the beginning of each token in the name string, and use %list to find possible keywords matches, then use a Levenshtein distance measure (and other algorithms) to determine the likelihood of what the information is supposed to represent.

A cool thing that hashes do not do, would be to allow a regex for a key. This way, one wouldnt have to cycle through the keys, and apply the regex to each. The Perl internal hashing algorithm could efficiently apply the regexing, so we could do something like:

my @matches = $hash_var{/^res/};
[download]

Comment on text string approxiamtions (concept for review) Select or Download Code

Replies are listed 'Best First'.
Re: text string approxiamtions (concept for review) by gjb (Vicar) on Dec 05, 2002 at 00:25 UTC
You may want to have a look at Tie::Hash::Approx and Tie::Hash::Regex. I'm not sure whether the latter is faster than a manual loop, but it might be worth a try. Hope this helps, -gjb-	[reply]
Re: text string approxiamtions (concept for review) by BrowserUk (Patriarch) on Dec 05, 2002 at 00:18 UTC
Minor nit. As your driving for efficiency you'd be better off not copying the array each time, just get and use the reference. `my $matches = $hash_var{/^res/};` [download] You'd still need to listify the array in order to pass it to `Text::Levenshtein::distance()`, but better to do it once rather than twice. You also might need to store the whole words in the list instead of just the suffixes as that algorthm needs that. Unless you intend re-implementing your own version, in which case you could gain/save by using passing the stem and the ref to the array of suffixes. Okay you lot, get your wings on the left, halos on the right. It's one size fits all, and "No!", you can't have a different color. Pick up your cloud down the end and "Yes" if you get allocated a grey one they *are* a bit damp under foot, but someone has to get them. Get used to the wings fast cos its an 8 hour day...unless the Govenor calls for a cyclone or hurricane, in which case 16 hour shifts are mandatory. Just be grateful that you arrived just as the tornado season finished. Them buggers are real work.	[reply] [d/l] [select]
Re: text string approxiamtions (concept for review) by seattlejohn (Deacon) on Dec 05, 2002 at 03:48 UTC
Be sure to look at String::Approx too. A cool thing that hashes do not do, would be to allow a regex for a key. I can't speak for earlier versions, but you can definitely use regexes as hash keys in 5.6.1: `use strict; my %tests = ( qr/^[0-9]+$/ => 'positive integer', qr/co/i => 'company/corporation', ); foreach my $term qw(12345 Company Jones) { foreach my $key (keys %tests) { print "$term is a $tests{$key}\n" if $term =~ $key; } }` [download] $perlmonks{seattlejohn} = 'John Clyman';	[reply] [d/l]
Re: Re: text string approxiamtions (concept for review) by shemp (Deacon) on Dec 05, 2002 at 16:46 UTC
I poorly stated my intentions for regexs with hashes. What i want is to be able to use a regex as a lookup of values in the hash. For instance: `my %names = ( "sean" => 29, "lisa" => 14, "bob" => 25, "matt" => 23, "tom" => 52 ); my @matches = %names{/a/};` [download] Here, the regex /a/ would match the keys "sean", "lisa", and "matt". so matches would contain the cooresponding values 29, 14, 23 It would also be cool to be able to get the matched keys, so something like `my @keys = keys %names{/a/};` [download] Then @keys would contain the array of matched keys: "sean", "lisa", "matt" I'm making up the syntax, but i think this better describes what i would like to be able to do.	[reply] [d/l] [select]
Re: Re: Re: text string approxiamtions (concept for review) by seattlejohn (Deacon) on Dec 05, 2002 at 19:14 UTC
`grep` and `map` can probably help. First, to get a subset of hash keys: `my @a_keys = grep {/a/} (keys %names);` Then to get their values: `my @a_values = map {$names{$_}} grep {/a/} (keys %names);` You could also get the job done using hash slices, like so: `my @a_values = @names{grep {/a/} keys %names};` That's more like the syntax you proposed, though I find it a bit disconcerting (maybe because I don't find the `@names{...}` hash-slice notation particularly natural in the first place). $perlmonks{seattlejohn} = 'John Clyman'; Update: only answered half the question in my original reply. Fixed here.	[reply] [d/l] [select]
Re: Re: Re: Re: text string approxiamtions (concept for review) by shemp (Deacon) on Dec 05, 2002 at 22:58 UTC
Re: text string approxiamtions (concept for review) by hawtin (Prior) on Dec 05, 2002 at 23:48 UTC
Have you looked at Soundex? It is the way that genealogists "approximately match" surnames (a few hundred years ago surname spellings were more variable than they are today). There is even a Text::Soundex CPAN module and a thread about it here Thought it might be of interest	[reply]


laziness, impatience, and hubris
	PerlMonks