http://qs321.pair.com?node_id=998358


in reply to how to generate confidence / weight score for a match?

Assuming you believe your list of good hostnames is relatively complete, then the key issue for confidence is how many other believable matches are there in your list of good host names. The actual number of characters matched won't tell you much because you could have only one match possibility for a host name beginning with "www.z", but 30 hostnames for a match beginning with "www.all". So even though there are more letters in "www.all" than "z", you should have a lot less confidence in any one match on "www.all" than on a match with "www.z".

In terms of assigning a numeric value to the match, a simple algorithm would be 1/possible_clean_names. This works if all possibilities are equally weighted as 1 and all non-possibilities get a weight of 0. If you want to get fancy and weight the possible clean matches your number would be (weight of selected match)/(sum of  weights for all possible matches).

The design problem then is to decide what constitutes a possible match, i.e. what it is that makes you or Roboticus intuitively say that there is no other thing that could be a likely match. Note that this rule might vary from log file to log file.

It would help if you know something about the algorithms used by these older files to truncate hostnames. For example did they use a strict 25 letter cutoff? If so, then for those log files you can assume that any hostname less than 25 characters is an exact match. For bad hostnames that are 25 characters, you could count anything in your clean name file as possible match if it matches on the first 25 characters. If there are two possible matches your confidence would be 50%. If there are four your confidence would be 25%.

What about top level domain names? Did the log file strip off the top level domain name? If so, for that log file you should ignore failed matches on the top level of the domain name. You would still have an exact match if your clean name file had only one match on everything but the top level domain name. If there were two host names that matched but with different top level domain names then your confidence would drop to 1 in 2 or 50%

What about typos? Where are these log files getting their truncated host names? If from HTTP headers, then typos are unlikely, but from browser request files or email headers where the sender information can be manipulated by a person and not auto-generated by a computer, typos might be an issue.

Hopefully, human input is not an issue. But if it is for some of the log files, you might be able to find some statistical data on the web for common typos in host names. If not, you'd need to know something about the likely language and keyboard of the user making the mistake. Typos tend to be language dependent because they rely on things like adjacent keys on the keyboard, phonetic rules of the language, and the degree to which the language's orthography matches its pronunciation.

Regardless of how you determine your list of potential typos, you could use that to add any hostname that matched except for typos to your list of possible matches. Since some typos are more probable than others, this might be a case where you would want to weight possibilities and use a weighted confidence algorithm. Mismatches due to inevitable truncation would get the highest weight since the bad name can't have ever been right. Mismatches that could have been entered in correctly, i.e. typos, but maybe weren't would get lower weights.

For instance you could weight a possibility that mismatched because of a log file truncation algorithm as 1.00 and mismatches due to common typos as 0.50 and mismatches due to very rare typos as 0.10. (Note: weights for typos arbitrary and unscientific - I just chose numbers less than one)