Beefy Boxes and Bandwidth Generously Provided by pair Networks
Pathologically Eclectic Rubbish Lister

comment on

( #3333=superdoc: print w/replies, xml ) Need Help??


I don't really know the magnitude of your problem, but when I face similar problems, what I do is add a column or two to the database for working on the values. For this case, I might add clean_hostname, and confidence.

Then, I'd figure out how many different hostnames I have to deal with:

select count(distinct hostname) as unique_hostnames, count(*) as record_count from my_table where clean_hostname is null

Next, I'd histogram the hostname frequencies, like:

select hostname, count(*) cnt from the_table where clean_hostname is null group by hostname order by count(*) desc

At this point, I'll recognize some "obviously correct" values, and fix them:

update my_table set clean_hostname = hostname, confidence=100.0 where hostname in ('', '', <etc>) or hostname like '' or hostname like ''

Similarly, I'll recognize some "obviously useless" values, and fix them, too:

update my_table set clean_hostname = '*GARBAGE*', confidence=100.0 where hostname in ('', '', <etc>) or hostname like ''

Then, I'd loop through again and find out how many records are left to go, and look at the new histogram. By inspection, you should be able to pick out a few more that you're pretty confident in, and fix them accordingly. You'll probably see a few patterns that you can use to correct the others, as well. For example, you might see a few unique mid-level domains that you can assign. For example, perhaps all hostnames ending like '.q9ix' are most likely part of ''.

update my_table set clean_hostname = hostname||'', confiden +ce=95.0 where hostname like '%.q9ix'

After a few passes, I sometimes find that most of the records are complete, and the data is now good enough that I can call the task 'done'. Other times, I find that there aren't enough patterns to take advantage of, so the data set doesn't reduce in size quickly enough. That's when I pull out perl and start thinking of coding something up to help out.

By looking at the most common items left over (I'll generally look over the top 50 or 100) and recognizing some patterns, you can tear through pretty quickly. As long as the most common item remaining is a significant percentage of the remaining records, I might persue it manually. If the most common item is too small a percentage, though, it's time to consider other things.


When your only tool is a hammer, all problems look like your thumb.

In reply to Re: how to generate confidence / weight score for a match? by roboticus
in thread how to generate confidence / weight score for a match? by chexmix

Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post; it's "PerlMonks-approved HTML":

  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.
  • Log In?

    What's my password?
    Create A New User
    and the web crawler heard nothing...

    How do I use this? | Other CB clients
    Other Users?
    Others musing on the Monastery: (9)
    As of 2020-11-25 20:49 GMT
    Find Nodes?
      Voting Booth?

      No recent polls found