http://qs321.pair.com?node_id=998346


in reply to how to generate confidence / weight score for a match?

chexmix:

I don't really know the magnitude of your problem, but when I face similar problems, what I do is add a column or two to the database for working on the values. For this case, I might add clean_hostname, and confidence.

Then, I'd figure out how many different hostnames I have to deal with:

select count(distinct hostname) as unique_hostnames, count(*) as record_count from my_table where clean_hostname is null

Next, I'd histogram the hostname frequencies, like:

select hostname, count(*) cnt from the_table where clean_hostname is null group by hostname order by count(*) desc

At this point, I'll recognize some "obviously correct" values, and fix them:

update my_table set clean_hostname = hostname, confidence=100.0 where hostname in ('first.correct.host', 'second.correct.host', <etc>) or hostname like '%.mybiz.com' or hostname like '%.joescompany.net'

Similarly, I'll recognize some "obviously useless" values, and fix them, too:

update my_table set clean_hostname = '*GARBAGE*', confidence=100.0 where hostname in ('first.correct.host', 'second.correct.host', <etc>) or hostname like '%.spamzalot.foo'

Then, I'd loop through again and find out how many records are left to go, and look at the new histogram. By inspection, you should be able to pick out a few more that you're pretty confident in, and fix them accordingly. You'll probably see a few patterns that you can use to correct the others, as well. For example, you might see a few unique mid-level domains that you can assign. For example, perhaps all hostnames ending like '.q9ix' are most likely part of '.q9ix.houston.com'.

update my_table set clean_hostname = hostname||'houston.com', confiden +ce=95.0 where hostname like '%.q9ix'

After a few passes, I sometimes find that most of the records are complete, and the data is now good enough that I can call the task 'done'. Other times, I find that there aren't enough patterns to take advantage of, so the data set doesn't reduce in size quickly enough. That's when I pull out perl and start thinking of coding something up to help out.

By looking at the most common items left over (I'll generally look over the top 50 or 100) and recognizing some patterns, you can tear through pretty quickly. As long as the most common item remaining is a significant percentage of the remaining records, I might persue it manually. If the most common item is too small a percentage, though, it's time to consider other things.

...roboticus

When your only tool is a hammer, all problems look like your thumb.