in reply to Fuzzy searching
turnstep's suggestion is excelent. There are 2 other modules you
may want to check out:
- Text::Soundex - good for determining is words sound-alike
- Text::Metaphone - better algorithm that Text::Soundex and good for phrases
Different algorithms may work better depending on exactly what piece(s) of data you want to de-duplicate on. If you're deduplicating on multiple fields some sort of hybrid de-duplication algorithm may be best. Here's an example deduplication scheme I cooked up for a database of people, where they live, and what their income is:
(String::Approx of LAST_NAME the same) and (INCOME within %5) and (STATE the same)In my experience coming up with a de-duplication scheme for user-entered data is easy. Coming up with a good one is hard and may take weeks or months of tuning.
In Section
Seekers of Perl Wisdom