japhy has asked for the wisdom of the Perl Monks concerning the following question:

I have information about two sets of domains (known to be of a certain category, and suspected to be of that category). Each of the suspected ones has its information compared to the known ones. I'm checking registrant and administrative contact details: company name, contact name, address, phone number, and email.

If XYZ.com's registrant's address is the same as ABC.com's registrant's address, then $match{ABC}{XYZ}{address} |= 1. If XYZ's admin's address is the same as ABC.com's admin's address, then $match{ABC}{XYZ}{address} |= 2. So on and so forth. Thus, I have a hash that looks like:

%match = ( 'ABC.com' => { 'XYZ.com' => { company => 1, # reg. only contact => 0, # no match address => 3, # reg. and admin. phone => 2, # admin. only email => 2, # admin. only }, ..., }, ... );
I want to come up with a logical metric that will allow me to sort through ABC.com's suspected linked domains in order of most likely connection to least likely. I don't want to sort merely by the number of non-zero shared info, but I don't want to multiply, and I'm not sure adding makes sense either. I'm not sure weighting comes into play; the fields have all been error checked, so there's no (truly) bogus data in them. (Some bogus data is there, but it's uniform, which is a good thing. Nevermind.)

Can someone enlighten me?


Jeff japhy Pinyan, P.L., P.M., P.O.D, X.S.: Perl, regex, and perl hacker
How can we ever be the sold short or the cheated, we who for every service have long ago been overpaid? ~~ Meister Eckhart

Replies are listed 'Best First'.
Re: Metric for confidence of complex match
by samtregar (Abbot) on Oct 27, 2005 at 21:37 UTC
    I'm not sure weighting comes into play

    That's the direction I'd head, I think. Each field contributes to the likelihood of a match, but probably not equally. If they are equal then you can just add them, right?

    One way to solve this is with a neural network. You'll need a training set of known records with known-good result values. If your training data is good then your network should be able to assign optimal weights to the various inputs. If you don't have a source of good, diverse training data then don't bother going down this path.

    -sam

Re: Metric for confidence of complex match
by BrowserUk (Patriarch) on Oct 27, 2005 at 23:54 UTC

    I think i would bit-encode the numbers into a single numeric value, ordering them from most to least significant.

    Eg. You have values A > B > C in terms of importance, and each could range from 0 .. 3, the the combination of A=2, B=3, C=1 would become b'101101'. Converted back to a single numeric value they can be sorted numerically and the position of the fields will weight the sort.

    You'll probably have to change the pack/unpack formats (switch the first 'b' for 'B'?), and possible drop one or both of the reverses on non-intel platforms.

    This assumes 6 values in the range 0 .. 3.

    #! perl -slw use strict; my @data = map{ [ map{ int rand 4 } 1 .. 6 ] } 1 .. 20; my @sorted = map { $_->[1] } sort { $b->[0] <=> $a->[0] } map { my $n = unpack 'N', pack 'B32', reverse join '', map{ scalar reverse unpack 'b2', pack 'C*',$_ } @$_ ; [ $n, $_ ] } @data; print "@$_" for @sorted; __END__ P:\test>503446.pl 3 3 2 3 3 3 3 3 2 3 1 0 3 2 1 2 2 1 3 1 3 3 1 0 3 0 3 1 2 2 3 0 0 2 1 3 2 2 1 1 2 1 2 1 3 3 3 0 2 0 0 3 0 3 2 0 0 2 1 0 1 3 3 3 2 1 1 3 1 0 1 2 1 2 1 0 1 1 1 0 3 1 2 2 1 0 0 1 2 1 0 1 3 1 2 3 0 1 1 3 3 0 0 1 0 3 3 1 0 0 0 3 3 2 0 0 0 3 1 3

    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    Lingua non convalesco, consenesco et abolesco. -- Rule 1 has a caveat! -- Who broke the cabal?
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.

      A rather simpler and portable version of the above that also contrasts the results with both the sum and product of the same values. It shows how the positional weighting remais true when the sum and product do not:

      #! perl -slw use strict; use List::Util qw[ sum reduce ]; our $SEED ||=1; srand( $SEED ); my @data = map{ [ map{ int rand 4 } 1 .. 6 ] } 1 .. 20; my @sorted = map { $_->[1] } sort { $b->[0] <=> $a->[0] } map { my $n = reduce{ $a << 2 | $b } 0, @$_; [ $n, $_ ] } @data; print "@$_\tsum:", sum( @$_ ), "\tproduct: ", reduce{ $a *= $b||1 } 1, + @$_ for @sorted; __END__ P:\test>503446.pl -SEED=-999998 3 2 2 3 1 1 sum:12 product: 36 3 2 0 1 3 2 sum:11 product: 36 3 1 3 1 3 0 sum:11 product: 27 3 1 3 0 0 2 sum:9 product: 18 3 0 2 2 3 2 sum:12 product: 72 3 0 1 1 0 0 sum:5 product: 3 2 2 1 0 1 2 sum:8 product: 8 2 1 1 0 3 2 sum:9 product: 12 2 1 0 3 1 1 sum:8 product: 6 2 0 3 0 2 2 sum:9 product: 24 2 0 1 1 2 1 sum:7 product: 4 1 3 2 2 2 0 sum:10 product: 24 1 3 0 1 3 1 sum:9 product: 9 1 2 1 1 0 1 sum:6 product: 2 1 1 3 2 3 0 sum:10 product: 18 1 0 2 0 1 1 sum:5 product: 2 0 3 3 0 1 2 sum:9 product: 18 0 2 3 2 0 0 sum:7 product: 12 0 2 3 0 3 1 sum:9 product: 18 0 1 3 0 3 1 sum:8 product: 9 P:\test>503446.pl -SEED=1000 3 3 0 3 0 0 sum:9 product: 27 3 3 0 2 3 3 sum:14 product: 162 3 1 3 1 3 0 sum:11 product: 27 3 1 3 0 0 2 sum:9 product: 18 3 0 2 2 3 2 sum:12 product: 72 3 0 1 1 0 0 sum:5 product: 3 2 2 1 0 1 2 sum:8 product: 8 2 1 1 0 3 2 sum:9 product: 12 2 1 0 3 1 1 sum:8 product: 6 2 0 3 0 2 2 sum:9 product: 24 2 0 1 1 2 1 sum:7 product: 4 1 3 2 2 2 0 sum:10 product: 24 1 3 0 1 3 1 sum:9 product: 9 1 2 1 1 0 1 sum:6 product: 2 1 1 3 2 3 0 sum:10 product: 18 1 0 2 0 1 1 sum:5 product: 2 0 3 3 0 1 2 sum:9 product: 18 0 2 3 2 0 0 sum:7 product: 12 0 2 3 0 3 1 sum:9 product: 18 0 1 3 0 3 1 sum:8 product: 9

      Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
      Lingua non convalesco, consenesco et abolesco. -- Rule 1 has a caveat! -- Who broke the cabal?
      "Science is about questioning the status quo. Questioning authority".
      In the absence of evidence, opinion is indistinguishable from prejudice.
Re: Metric for confidence of complex match
by GrandFather (Saint) on Oct 27, 2005 at 21:20 UTC

    Care to give us a few more candiates and the order you think you would like to see them matched? For example how do the following sets "sort" (assuming the same coding used for XYZ):

    (0, 1, 3, 3, 1), (0, 2, 2, 2, 2), (1, 1, 2, 2, 2), (1, 1, 1, 2, 3), (0, 0, 3, 3, 3),

    Perl is Huffman encoded by design.
Re: Metric for confidence of complex match
by Util (Priest) on Oct 27, 2005 at 22:17 UTC

    The basic value would be the number of bits set. You can add weight to certain bits, or to certain fields, or to both.
    Terribly under-tested code:

    use strict; use warnings; my %match = ( 'ABC.com' => { 'XYZ.com' => { company => 1, # reg. only contact => 0, # no match address => 3, # reg. and admin. phone => 2, # admin. only email => 2, # admin. only }, 'BBC.com' => { company => 3, contact => 3, address => 3, phone => 3, email => 3, }, }, 'FOO.com' => { 'BAR.com' => { company => 1, contact => 3, address => 3, phone => 2, email => 1, }, 'BAZ.com' => { company => 3, contact => 2, address => 2, phone => 2, email => 2, }, }, ); # I'm sure there is a faster way to count bits. sub bit_count { my ( $number ) = @_; my $bits = unpack 'b*', pack 'C', $number; my $bits_set = ( $bits =~ tr/1// ); return $bits_set; } sub confidence_by_bits { my ( $href ) = @_; my $weight = 0; foreach ( values %{ $href } ) { $weight += bit_count($_); } return $weight; } sub confidence_by_weighted_position { my ( $href ) = @_; my $weight = 0; foreach ( values %{ $href } ) { # Reg - slightly important if ( $_ | 1 ) { $weight += 1; } # Admin - very important if ( $_ | 2 ) { $weight += 5; } # Bonus for both if ( $_ == 3 ) { $weight += 20; } } return $weight; } sub confidence_by_weighted_field { my ( $href ) = @_; my $weight = 0; while ( my ( $key, $val ) = each %{ $href } ) { my $bits = bit_count($val); if ( $key eq 'company' ) { $bits *= 5; } elsif ( $key eq 'phone' ) { $bits *= 0.8; } $weight += $bits; } return $weight; } print_with_confidence( \%match, \&confidence_by_bits ); print_with_confidence( \%match, \&confidence_by_weighted_position ); print_with_confidence( \%match, \&confidence_by_weighted_field ); sub print_with_confidence { my ( $match_href, $conf_func ) = @_; while ( my ( $domain1, $v1 ) = each %match ) { my @candidates; while ( my ( $domain2, $v2 ) = each %$v1 ) { my $confidence = $conf_func->( $v2 ); push @candidates, [ $domain2, $confidence ]; } @candidates = sort { $b->[1] <=> $a->[1] || $b->[0] cmp $a->[0] } @candidates; print "$domain1\n"; foreach ( @candidates ) { my ( $domain2, $confidence ) = @{ $_ }; printf "\t%7d\t%s\n", $confidence, $domain2; } } print "\n"; }
    Output:
    ABC.com 10 BBC.com 5 XYZ.com FOO.com 7 BAR.com 6 BAZ.com ABC.com 130 BBC.com 50 XYZ.com FOO.com 70 BAR.com 50 BAZ.com ABC.com 17 BBC.com 8 XYZ.com FOO.com 13 BAZ.com 10 BAR.com

Re: Metric for confidence of complex match
by GrandFather (Saint) on Oct 27, 2005 at 21:50 UTC

    The following weights each match by making better field matches much more important than lesser field matches:

    use warnings; use strict; my $line = 0; my @data = map {chomp; [split, ++$line, -1]} (<DATA>); @data = sort {&doCmp} @data; print join "\n", map {join " ", @$_} @data; sub doCmp { calcWeight ($a) if -1 == $a->[-1]; calcWeight ($b) if -1 == $b->[-1]; return $a->[-1] <=> $b->[-1]; } sub calcWeight { my $vector = \@{$_[0]}; my $sum = 0; $sum += $vector->[$_] * $vector->[$_] for (0..@$vector-3); $vector->[-1] = $sum; } __DATA__ 0 1 3 3 1 0 2 2 2 2 1 1 2 2 2 1 1 1 2 3 0 0 3 3 3

    Prints:

    1 1 2 2 2 3 14 0 2 2 2 2 2 16 1 1 1 2 3 4 16 0 1 3 3 1 1 20 0 0 3 3 3 5 27

    Note that the last two columns are the original line number and the calculated weighting respectively.


    Perl is Huffman encoded by design.
Re: Metric for confidence of complex match
by sauoq (Abbot) on Oct 27, 2005 at 22:42 UTC

    I'd probably try an additive approach, but it seems any confidence metric based on these data would be pretty arbitrary. If I understand the problem, you could have zeros for all fields of two domains that are actually "linked" or you could have 3s for all fields of two domains that are not linked (proxies for private registration spring to mind.)

    -sauoq
    "My two cents aren't worth a dime.";
    
      Well, there has to be at least one non-zero field, or the link wouldn't have been determined. And as for proxies, I've already removed all those sorts of false positives. I did that for several hours yesterday. I have a fun job.

      Jeff japhy Pinyan, P.L., P.M., P.O.D, X.S.: Perl, regex, and perl hacker
      How can we ever be the sold short or the cheated, we who for every service have long ago been overpaid? ~~ Meister Eckhart