in reply to Finding Keys from data

Here is one way to do it:

#! perl -slw use strict; use Algorithm::Combinatorics qw[ variations ]; my %uniq; while( <DATA> ) { chomp; my @fields = split ','; OUTER: for my $k ( 1 .. $#fields ) { my $iter = variations( \@fields, $k ); while( my $v = $iter->next ) { my $key = join ':', @$v; unless( exists $uniq{ $key } ) { $uniq{ $key } = $.; last OUTER; } } } } my @lineorder = sort{ $uniq{ $a } <=> $uniq{ $b } } keys %uniq; print "$uniq{ $_ } :: $_" for @lineorder; __DATA__ 2,6,5,2,1,1,7,9,8,6 5,8,5,0,9,3,8,9,0,2 2,3,1,1,5,2,9,7,8,3 0,1,3,7,6,2,4,3,7,5 4,6,2,8,6,4,1,5,4,3 4,6,7,2,0,9,6,5,0,9 5,6,2,4,3,7,1,9,3,5 2,5,7,1,0,0,0,5,8,5 3,8,1,4,9,2,5,8,1,0 5,2,2,2,0,7,2,8,3,1 7,1,2,6,5,4,0,9,2,5 1,6,3,7,3,8,7,0,7,7 0,0,8,9,9,8,3,3,6,0 0,2,5,3,8,4,1,8,9,4 5,6,9,0,6,4,9,5,0,7 9,0,9,3,2,6,3,2,4,6 3,3,0,4,8,5,7,7,2,4 3,1,3,0,0,3,1,7,3,8 0,6,7,0,8,9,4,8,4,8 0,2,0,3,7,4,6,8,4,5

Output (line number :: uniq key}

c:\test>896650 1 :: 2 2 :: 5 3 :: 3 4 :: 0 5 :: 4 6 :: 6 7 :: 7 8 :: 1 9 :: 8 10 :: 5:2 11 :: 9 12 :: 1:6 13 :: 0:0 14 :: 0:2 15 :: 5:6 16 :: 9:0 17 :: 3:3 18 :: 3:1 19 :: 0:6 20 :: 0:3

But do realise that this is a massively combinatorial problem, and given the size of your data, if there is a high degree of similarity in your lines, it could take a very long time.

And, the results produced are in no way 'optimal'. That is to say, there is no attempt here to produce the set of shortest signatures. To do so would make this an NP hard problem of a massive scale.

By way of example: With 100 fields, there are ~10^157 variations of fields for each line. To find the optimal solution, you'd need to try (something like) 10^157! permutations. That number is so massive as to be impossible to calculate--never mind performing that number of trials--even if you had all the computers and storage in the world.


Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.

Replies are listed 'Best First'.
Re^2: Finding Keys from data
by jethro (Monsignor) on Mar 31, 2011 at 17:02 UTC

    Why 10^257? A column is either in or not in the key. That would mean 2^100 possible solutions to try

    Or are you refering only to the worst case of your algorithm? Seems using permutation/variation might lead to much overhead in the search. If the combination 2:4 were already tested for example, testing 4:2 is not necessary anymore

      Why 10^257? A column is either in or not in the key. That would mean 2^100 possible solutions to try

      For a unique key 'ab' is different to 'ba'.

      So, it should be 100! + 99! + 98! + 97! ... so I rounded up 10^157.

      To be honest, given the scale of the numbers, it makes little difference.


      Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
      "Science is about questioning the status quo. Questioning authority".
      In the absence of evidence, opinion is indistinguishable from prejudice.

        I believe jethro is thinking about the problem actually proposed while BrowserUk is talking about a different problem. Of course, the root node is not completely clear on this point, so it could be that BrowserUk is correct and several others who have replied are the ones who are wrong.

        BrowserUk is picking a (potentially) different set of columns for each row while I believe a single list of columns is desired such that this same list of columns (in the same order) provide a unique key for each row.

        With a single list of columns and a useful delimiter between values, then the order no longer matters and only 2**100 potential solutions exist. Of course, you can also envision data where there is no completely effective delimiter (unless you also provide for escaping parts of values when the key is constructed). I choose to assume either an easy delimiter choice or building the key with potential escaping such as constructing the key using standard CSV format.

        - tye