http://qs321.pair.com?node_id=21360

Buckaroo Buddha has asked for the wisdom of the Perl Monks concerning the following question:


here's a question for the best of you ...
maybe this is an algorythm question, i'm not sure,
i've got a pattern that may change slightly from one
situation to another, let's call it title

this title can be passed from one person to another or can change slightly and still be the same thing.

i've got to run a match on that title over a period
of time and run a calculation based on a data-element
at that period of time.
I'm using a multi dimentional hash of arrays

here is the code for my that statement which should give a better understanding of the data-structure and what i'm trying to do ...

sub Compress { foreach my $KEYmonth (sort keys %{$_[0]}) { foreach my $KEYcat (sort keys %{$_[0]{$KEYmonth}}) {#cat= category foreach my $KEYsubcat (sort keys %{$_[0]{$KEYmonth}{$KEYcat}}) { $i=0; foreach my $value (@{$_[0]{$KEYmonth}{$KEYcat}{$KEYsubcat}}) { + $OUTPUT{'YTD'}{$KEYcat}{$KEYsubcat}[$i++] += $value; + } } } } }

i do this over a number of months for many different
categories, and sub-categories. Unfortunately, in one (or two)
of the situations the subcategory-keys are variable enough
to prevent a positive match.

the data could be:
 "INITIAL.LASTNAME(x)(LONG NAME TITLE)" one month and:
 "INTL.SOMEOTHERNAME(x)(LONG NAME TITLE)" or:
 "INITIAL.LASTNAME(x)(LONG NAME. TITLE)" the next.

that's a mightly long setup for the question but i'm
trying to make it both specific to my problem and generalizable
to the world (sortof like something you could put into a textbook
example if it was well answered, eh merlyn ? *jk* ;)

the question to this is then, how would one go about designing a
matching process that could calculate a confidence interval for NEAR-MATCHES
and in the event of finding this close match use it (and its key) for the
calculation.

this is intended to be a fun interesting exercise, i've tried to
write a NEAR-MATCH function before (and failed miserably).
I'm looking to learn some process or method for tackling
this sort of problem (which i can only assume others will have had experience with)
but i might not necessarily do it (i'm allowed to tell them 'it can't be done in
situation x').

Replies are listed 'Best First'.
Re: pattern matching with heuristics
by ZZamboni (Curate) on Jul 06, 2000 at 23:13 UTC
    A good starting point could be String::Approx. I have not used it, but the description says it does fuzzy (approximate) string matching.

    --ZZamboni

RE: pattern matching with heuristics
by Ovid (Cardinal) on Jul 06, 2000 at 23:28 UTC
    ZZamboni is right about using String::Approx to get this done. But be forewarned: String::Approx runs about as fast as a one-legged dog. I don't remember how much slower (the info is in the Perl Cookbook which I can't find since I moved, sniff, sniff), but I seem to recall it's something like 10 times slower than a typical regex. Thus, you don't want to use if if you can avoid it, and if you can't, try to avoid iterating over it.

    Cheers,
    Ovid

    Update: kudra has her copy of the Cookbook (I'm so jealous) and says that it is 10-40 times slower. If she's wrong, we'll all just fly over to the Netherlands and have a little "chat" with her about supplying us with innacurate data. Then we'll party.

(jeffa) Re: pattern matching with heuristics
by jeffa (Bishop) on Jul 07, 2000 at 17:03 UTC
    . . . if you don't like the results of String::Approx, try Text::Soundex - it's slow as well, unfortunately.
      Text::Soundex is good for finding "sound-alike" matches, but only for English words. If this is the kind of think you're looking for, you may also want to consider Text::Metaphone. It provides similar functionality to Text::Soundex (reducing a word to a simplified "sound description") but is more "phonetically" advanced and can handle phrases, not just single words. Also, neither of these two as slow at String::Approx, as String::Approx has to compute the edit distance between two strings and both Soundex and Metaphone just reduce a string to a "sound description" of it. Of course, both Metaphone and Soundex are of no use to you if your data is not English text.
Re: pattern matching with heuristics
by Anonymous Monk on Jul 07, 2000 at 18:11 UTC
    Well, it's not strictly Perl, but it may be usefull anyway. If you are using a file to store these titles over time and that file is in a format that you can use one of the DBI modules to search against, you could craft a SQL query to do near matches. SQL offers the LIKE select type, which I have used to get approximate information in the past. I have never used String::Approx, so I can't offer you a speed comparision, but using the SQL way was plenty fast enough to be used in a web search without noticed latency. http://w3.one.net/~jhoffman/sqltut.htm#Using_LIKE