Beefy Boxes and Bandwidth Generously Provided by pair Networks
Problems? Is your data what you think it is?
 
PerlMonks  

Re: how to generate confidence / weight score for a match?

by roboticus (Chancellor)
on Oct 11, 2012 at 00:03 UTC ( [id://998346]=note: print w/replies, xml ) Need Help??


in reply to how to generate confidence / weight score for a match?

chexmix:

I don't really know the magnitude of your problem, but when I face similar problems, what I do is add a column or two to the database for working on the values. For this case, I might add clean_hostname, and confidence.

Then, I'd figure out how many different hostnames I have to deal with:

select count(distinct hostname) as unique_hostnames, count(*) as record_count from my_table where clean_hostname is null

Next, I'd histogram the hostname frequencies, like:

select hostname, count(*) cnt from the_table where clean_hostname is null group by hostname order by count(*) desc

At this point, I'll recognize some "obviously correct" values, and fix them:

update my_table set clean_hostname = hostname, confidence=100.0 where hostname in ('first.correct.host', 'second.correct.host', <etc>) or hostname like '%.mybiz.com' or hostname like '%.joescompany.net'

Similarly, I'll recognize some "obviously useless" values, and fix them, too:

update my_table set clean_hostname = '*GARBAGE*', confidence=100.0 where hostname in ('first.correct.host', 'second.correct.host', <etc>) or hostname like '%.spamzalot.foo'

Then, I'd loop through again and find out how many records are left to go, and look at the new histogram. By inspection, you should be able to pick out a few more that you're pretty confident in, and fix them accordingly. You'll probably see a few patterns that you can use to correct the others, as well. For example, you might see a few unique mid-level domains that you can assign. For example, perhaps all hostnames ending like '.q9ix' are most likely part of '.q9ix.houston.com'.

update my_table set clean_hostname = hostname||'houston.com', confiden +ce=95.0 where hostname like '%.q9ix'

After a few passes, I sometimes find that most of the records are complete, and the data is now good enough that I can call the task 'done'. Other times, I find that there aren't enough patterns to take advantage of, so the data set doesn't reduce in size quickly enough. That's when I pull out perl and start thinking of coding something up to help out.

By looking at the most common items left over (I'll generally look over the top 50 or 100) and recognizing some patterns, you can tear through pretty quickly. As long as the most common item remaining is a significant percentage of the remaining records, I might persue it manually. If the most common item is too small a percentage, though, it's time to consider other things.

...roboticus

When your only tool is a hammer, all problems look like your thumb.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://998346]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others examining the Monastery: (2)
As of 2024-04-25 05:39 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found