Metric for confidence of complex match

japhy has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: Metric for confidence of complex match by samtregar (Abbot) on Oct 27, 2005 at 21:37 UTC
I'm not sure weighting comes into play That's the direction I'd head, I think. Each field contributes to the likelihood of a match, but probably not equally. If they are equal then you can just add them, right? One way to solve this is with a neural network. You'll need a training set of known records with known-good result values. If your training data is good then your network should be able to assign optimal weights to the various inputs. If you don't have a source of good, diverse training data then don't bother going down this path. -sam	[reply]
Re: Metric for confidence of complex match by BrowserUk (Patriarch) on Oct 27, 2005 at 23:54 UTC
I think i would bit-encode the numbers into a single numeric value, ordering them from most to least significant. Eg. You have values A > B > C in terms of importance, and each could range from 0 .. 3, the the combination of A=2, B=3, C=1 would become b'101101'. Converted back to a single numeric value they can be sorted numerically and the position of the fields will weight the sort. You'll probably have to change the pack/unpack formats (switch the first 'b' for 'B'?), and possible drop one or both of the reverses on non-intel platforms. This assumes 6 values in the range 0 .. 3. #! perl -slw use strict; my @data = map{ [ map{ int rand 4 } 1 .. 6 ] } 1 .. 20; my @sorted = map { $_->[1] } sort { $b->[0] <=> $a->[0] } map { my $n = unpack 'N', pack 'B32', reverse join '', map{ scalar reverse unpack 'b2', pack 'C*',$_ } @$_ ; [ $n, $_ ] } @data; print "@$_" for @sorted; __END__ P:\test>503446.pl 3 3 2 3 3 3 3 3 2 3 1 0 3 2 1 2 2 1 3 1 3 3 1 0 3 0 3 1 2 2 3 0 0 2 1 3 2 2 1 1 2 1 2 1 3 3 3 0 2 0 0 3 0 3 2 0 0 2 1 0 1 3 3 3 2 1 1 3 1 0 1 2 1 2 1 0 1 1 1 0 3 1 2 2 1 0 0 1 2 1 0 1 3 1 2 3 0 1 1 3 3 0 0 1 0 3 3 1 0 0 0 3 3 2 0 0 0 3 1 3 [download] Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. Lingua non convalesco, consenesco et abolesco. -- Rule 1 has a caveat! -- Who broke the cabal? "Science is about questioning the status quo. Questioning authority". In the absence of evidence, opinion is indistinguishable from prejudice.	[reply] [d/l]
Re^2: Metric for confidence of complex match by BrowserUk (Patriarch) on Oct 28, 2005 at 05:34 UTC
A rather simpler and portable version of the above that also contrasts the results with both the sum and product of the same values. It shows how the positional weighting remais true when the sum and product do not: #! perl -slw use strict; use List::Util qw[ sum reduce ]; our $SEED \|\|=1; srand( $SEED ); my @data = map{ [ map{ int rand 4 } 1 .. 6 ] } 1 .. 20; my @sorted = map { $_->[1] } sort { $b->[0] <=> $a->[0] } map { my $n = reduce{ $a << 2 \| $b } 0, @$_; [ $n, $_ ] } @data; print "@$_\tsum:", sum( @$_ ), "\tproduct: ", reduce{ $a *= $b\|\|1 } 1, + @$_ for @sorted; __END__ P:\test>503446.pl -SEED=-999998 3 2 2 3 1 1 sum:12 product: 36 3 2 0 1 3 2 sum:11 product: 36 3 1 3 1 3 0 sum:11 product: 27 3 1 3 0 0 2 sum:9 product: 18 3 0 2 2 3 2 sum:12 product: 72 3 0 1 1 0 0 sum:5 product: 3 2 2 1 0 1 2 sum:8 product: 8 2 1 1 0 3 2 sum:9 product: 12 2 1 0 3 1 1 sum:8 product: 6 2 0 3 0 2 2 sum:9 product: 24 2 0 1 1 2 1 sum:7 product: 4 1 3 2 2 2 0 sum:10 product: 24 1 3 0 1 3 1 sum:9 product: 9 1 2 1 1 0 1 sum:6 product: 2 1 1 3 2 3 0 sum:10 product: 18 1 0 2 0 1 1 sum:5 product: 2 0 3 3 0 1 2 sum:9 product: 18 0 2 3 2 0 0 sum:7 product: 12 0 2 3 0 3 1 sum:9 product: 18 0 1 3 0 3 1 sum:8 product: 9 P:\test>503446.pl -SEED=1000 3 3 0 3 0 0 sum:9 product: 27 3 3 0 2 3 3 sum:14 product: 162 3 1 3 1 3 0 sum:11 product: 27 3 1 3 0 0 2 sum:9 product: 18 3 0 2 2 3 2 sum:12 product: 72 3 0 1 1 0 0 sum:5 product: 3 2 2 1 0 1 2 sum:8 product: 8 2 1 1 0 3 2 sum:9 product: 12 2 1 0 3 1 1 sum:8 product: 6 2 0 3 0 2 2 sum:9 product: 24 2 0 1 1 2 1 sum:7 product: 4 1 3 2 2 2 0 sum:10 product: 24 1 3 0 1 3 1 sum:9 product: 9 1 2 1 1 0 1 sum:6 product: 2 1 1 3 2 3 0 sum:10 product: 18 1 0 2 0 1 1 sum:5 product: 2 0 3 3 0 1 2 sum:9 product: 18 0 2 3 2 0 0 sum:7 product: 12 0 2 3 0 3 1 sum:9 product: 18 0 1 3 0 3 1 sum:8 product: 9 [download] Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. Lingua non convalesco, consenesco et abolesco. -- Rule 1 has a caveat! -- Who broke the cabal? "Science is about questioning the status quo. Questioning authority". In the absence of evidence, opinion is indistinguishable from prejudice.	[reply] [d/l]
Re: Metric for confidence of complex match by GrandFather (Saint) on Oct 27, 2005 at 21:20 UTC
Care to give us a few more candiates and the order you think you would like to see them matched? For example how do the following sets "sort" (assuming the same coding used for XYZ): `(0, 1, 3, 3, 1), (0, 2, 2, 2, 2), (1, 1, 2, 2, 2), (1, 1, 1, 2, 3), (0, 0, 3, 3, 3),` [download] Perl is Huffman encoded by design.	[reply] [d/l]
Re: Metric for confidence of complex match by Util (Priest) on Oct 27, 2005 at 22:17 UTC
The basic value would be the number of bits set. You can add weight to certain bits, or to certain fields, or to both. Terribly under-tested code: use strict; use warnings; my %match = ( 'ABC.com' => { 'XYZ.com' => { company => 1, # reg. only contact => 0, # no match address => 3, # reg. and admin. phone => 2, # admin. only email => 2, # admin. only }, 'BBC.com' => { company => 3, contact => 3, address => 3, phone => 3, email => 3, }, }, 'FOO.com' => { 'BAR.com' => { company => 1, contact => 3, address => 3, phone => 2, email => 1, }, 'BAZ.com' => { company => 3, contact => 2, address => 2, phone => 2, email => 2, }, }, ); # I'm sure there is a faster way to count bits. sub bit_count { my ( $number ) = @_; my $bits = unpack 'b', pack 'C', $number; my $bits_set = ( $bits =~ tr/1// ); return $bits_set; } sub confidence_by_bits { my ( $href ) = @_; my $weight = 0; foreach ( values %{ $href } ) { $weight += bit_count($_); } return $weight; } sub confidence_by_weighted_position { my ( $href ) = @_; my $weight = 0; foreach ( values %{ $href } ) { # Reg - slightly important if ( $_ \| 1 ) { $weight += 1; } # Admin - very important if ( $_ \| 2 ) { $weight += 5; } # Bonus for both if ( $_ == 3 ) { $weight += 20; } } return $weight; } sub confidence_by_weighted_field { my ( $href ) = @_; my $weight = 0; while ( my ( $key, $val ) = each %{ $href } ) { my $bits = bit_count($val); if ( $key eq 'company' ) { $bits = 5; } elsif ( $key eq 'phone' ) { $bits *= 0.8; } $weight += $bits; } return $weight; } print_with_confidence( \%match, \&confidence_by_bits ); print_with_confidence( \%match, \&confidence_by_weighted_position ); print_with_confidence( \%match, \&confidence_by_weighted_field ); sub print_with_confidence { my ( $match_href, $conf_func ) = @_; while ( my ( $domain1, $v1 ) = each %match ) { my @candidates; while ( my ( $domain2, $v2 ) = each %$v1 ) { my $confidence = $conf_func->( $v2 ); push @candidates, [ $domain2, $confidence ]; } @candidates = sort { $b->[1] <=> $a->[1] \|\| $b->[0] cmp $a->[0] } @candidates; print "$domain1\n"; foreach ( @candidates ) { my ( $domain2, $confidence ) = @{ $_ }; printf "\t%7d\t%s\n", $confidence, $domain2; } } print "\n"; } [download] Output: `ABC.com 10 BBC.com 5 XYZ.com FOO.com 7 BAR.com 6 BAZ.com ABC.com 130 BBC.com 50 XYZ.com FOO.com 70 BAR.com 50 BAZ.com ABC.com 17 BBC.com 8 XYZ.com FOO.com 13 BAZ.com 10 BAR.com` [download]	[reply] [d/l] [select]
Re: Metric for confidence of complex match by GrandFather (Saint) on Oct 27, 2005 at 21:50 UTC
The following weights each match by making better field matches much more important than lesser field matches: `use warnings; use strict; my $line = 0; my @data = map {chomp; [split, ++$line, -1]} (<DATA>); @data = sort {&doCmp} @data; print join "\n", map {join " ", @$_} @data; sub doCmp { calcWeight ($a) if -1 == $a->[-1]; calcWeight ($b) if -1 == $b->[-1]; return $a->[-1] <=> $b->[-1]; } sub calcWeight { my $vector = \@{$_[0]}; my $sum = 0; $sum += $vector->[$_] * $vector->[$_] for (0..@$vector-3); $vector->[-1] = $sum; } __DATA__ 0 1 3 3 1 0 2 2 2 2 1 1 2 2 2 1 1 1 2 3 0 0 3 3 3` [download] Prints: `1 1 2 2 2 3 14 0 2 2 2 2 2 16 1 1 1 2 3 4 16 0 1 3 3 1 1 20 0 0 3 3 3 5 27` [download] Note that the last two columns are the original line number and the calculated weighting respectively. Perl is Huffman encoded by design.	[reply] [d/l] [select]
Re: Metric for confidence of complex match by sauoq (Abbot) on Oct 27, 2005 at 22:42 UTC
I'd probably try an additive approach, but it seems any confidence metric based on these data would be pretty arbitrary. If I understand the problem, you could have zeros for all fields of two domains that are actually "linked" or you could have 3s for all fields of two domains that are not linked (proxies for private registration spring to mind.) -sauoq "My two cents aren't worth a dime.";	[reply]
Re^2: Metric for confidence of complex match by japhy (Canon) on Oct 27, 2005 at 23:03 UTC
Well, there has to be at least one non-zero field, or the link wouldn't have been determined. And as for proxies, I've already removed all those sorts of false positives. I did that for several hours yesterday. I have a fun job. Jeff `japhy` Pinyan, P.L., P.M., P.O.D, X.S.: Perl, regex, and `perl` hacker How can we ever be the sold short or the cheated, we who for every service have long ago been overpaid? ~~ Meister Eckhart	[reply]