Beefy Boxes and Bandwidth Generously Provided by pair Networks
P is for Practical
 
PerlMonks  

comment on

( [id://3333]=superdoc: print w/replies, xml ) Need Help??

chexmix:

I don't really know the magnitude of your problem, but when I face similar problems, what I do is add a column or two to the database for working on the values. For this case, I might add clean_hostname, and confidence.

Then, I'd figure out how many different hostnames I have to deal with:

select count(distinct hostname) as unique_hostnames, count(*) as record_count from my_table where clean_hostname is null

Next, I'd histogram the hostname frequencies, like:

select hostname, count(*) cnt from the_table where clean_hostname is null group by hostname order by count(*) desc

At this point, I'll recognize some "obviously correct" values, and fix them:

update my_table set clean_hostname = hostname, confidence=100.0 where hostname in ('first.correct.host', 'second.correct.host', <etc>) or hostname like '%.mybiz.com' or hostname like '%.joescompany.net'

Similarly, I'll recognize some "obviously useless" values, and fix them, too:

update my_table set clean_hostname = '*GARBAGE*', confidence=100.0 where hostname in ('first.correct.host', 'second.correct.host', <etc>) or hostname like '%.spamzalot.foo'

Then, I'd loop through again and find out how many records are left to go, and look at the new histogram. By inspection, you should be able to pick out a few more that you're pretty confident in, and fix them accordingly. You'll probably see a few patterns that you can use to correct the others, as well. For example, you might see a few unique mid-level domains that you can assign. For example, perhaps all hostnames ending like '.q9ix' are most likely part of '.q9ix.houston.com'.

update my_table set clean_hostname = hostname||'houston.com', confiden +ce=95.0 where hostname like '%.q9ix'

After a few passes, I sometimes find that most of the records are complete, and the data is now good enough that I can call the task 'done'. Other times, I find that there aren't enough patterns to take advantage of, so the data set doesn't reduce in size quickly enough. That's when I pull out perl and start thinking of coding something up to help out.

By looking at the most common items left over (I'll generally look over the top 50 or 100) and recognizing some patterns, you can tear through pretty quickly. As long as the most common item remaining is a significant percentage of the remaining records, I might persue it manually. If the most common item is too small a percentage, though, it's time to consider other things.

...roboticus

When your only tool is a hammer, all problems look like your thumb.


In reply to Re: how to generate confidence / weight score for a match? by roboticus
in thread how to generate confidence / weight score for a match? by chexmix

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post; it's "PerlMonks-approved HTML":



  • Are you posting in the right place? Check out Where do I post X? to know for sure.
  • Posts may use any of the Perl Monks Approved HTML tags. Currently these include the following:
    <code> <a> <b> <big> <blockquote> <br /> <dd> <dl> <dt> <em> <font> <h1> <h2> <h3> <h4> <h5> <h6> <hr /> <i> <li> <nbsp> <ol> <p> <small> <strike> <strong> <sub> <sup> <table> <td> <th> <tr> <tt> <u> <ul>
  • Snippets of code should be wrapped in <code> tags not <pre> tags. In fact, <pre> tags should generally be avoided. If they must be used, extreme care should be taken to ensure that their contents do not have long lines (<70 chars), in order to prevent horizontal scrolling (and possible janitor intervention).
  • Want more info? How to link or How to display code and escape characters are good places to start.
Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others chanting in the Monastery: (3)
As of 2024-03-29 01:37 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found