in reply to Match similar text

Take a look at this node posted yesterday, though Text::Levenshtein is usually the standard answer.

I would do something like the following:

  • Set a maximum threshold, so if the closest match exceeded this threshold it would be set aside for human interaction
  • Iterate over each state calculating the similarity distance and select the shortest distance
  • Set aside for human interaction any match between two states that was close, perhaps only by a distance of 1
  • Write a log for changes until you feel confident/comfortable it is doing the right thing

    Cheers - L~R

  • Replies are listed 'Best First'.
    Re: Re: Match similar text
    by exussum0 (Vicar) on Sep 07, 2003 at 00:24 UTC
      In conjunction w/ that, the person might want to get all distinct mispelled states and update them at once. If his DB is big, say 6mil rows, it'd be 6mil selects, and 1 update for every misspelled row, vs 1 select on a 6 million row table and 1 update for each misspelling.

      Maybe the person already does this, but might as well be obvious :)
      Play that funky music white boy..