aartist has asked for the wisdom of the Perl Monks concerning the following question:

I have a CSV file and has 100+ columns and 10,000+ lines. I like to find unique key from the data. Key is formed by mixing multiple columns. For example columns 1,2,3,12,14 could form a unique key for the given data. How would I find such thing? I can take all the columns and that will make it unique, but I like to do with minimum number of columns possible.

Thanks.

Replies are listed 'Best First'.
Re: Finding Keys from data
by moritz (Cardinal) on Mar 31, 2011 at 15:46 UTC
Re: Finding Keys from data
by jethro (Monsignor) on Mar 31, 2011 at 15:46 UTC

    Finding an optimal solution may not be possible (without searching the complete space of 2^100 possibilities) but you might try something like this:

    1) For each column count the number of different values in it (easily done with hashes)

    2) Sort that column list lowest number first

    3) Set a key of all columns

    3) Foreach column on that list try to remove it from the key and see if two lines have the same key now. if yes, re-add the column to your key, otherwise leave it off

    This will give you exactly one solution. If you want to look for better solutions, you might rerun the algorithm, but scramble the sorted list of columns somewhat to get at different solutions. You could even use a totally random column list, but most likely you will get too many suboptimal solutions presented that way

      jethro:

      I like that approach, but rather than excluding columns one-by-one, I'd include them one-by-one, like this:

      $ cat 896650.pl #!/usr/bin/perl use strict; use warnings; # Generate values my @cols = ( 0 .. 10 ); my $rows=50000; my @data = ( [ map { int 100*rand } @cols ] ); my @gen = ( map { [ 100*rand, int 100*(1+$_*$_*$_*rand) ] } @cols ); for my $r (1 .. $rows) { for my $c ( @cols ) { $data[$r][$c] = int( $data[$r-1][$c] + $gen[$c][0]*rand) % ($gen[$c][1]); } } # Count distinct column values my %H; for (@data) { my @v=@$_; for my $c (@cols) { $H{$c}{$v[$c]}++; } } # Sort so most diverse comes first my @colrank = sort { $$b[1] <=> $$a[1] } map { [ $_, scalar keys %{$H{$_}} ] } @cols; my @colOrder = map { $$_[0] } @colrank; for my $r (@colrank) { print "col $$r[0] has $$r[1] values\n"; } print join(", ", @colOrder), "\n"; print scalar(@data), " rows to process.\n"; my @tmpdata = @data; my @keycols = ( shift @colOrder ); while (keys %H < @tmpdata) { print scalar(@tmpdata), " rows to process.\n"; %H=(); for (@tmpdata) { my $key=join(".", @$_[@keycols]); $H{$key}++; } print "cols(", join(",",@keycols), ") give ", scalar(keys %H), " rows.\n"; push @keycols, shift @colOrder; } $ perl 896650.pl col 10 has 37754 values col 9 has 29806 values col 8 has 29772 values col 7 has 22835 values col 6 has 16081 values col 5 has 10136 values col 3 has 2726 values col 4 has 559 values col 2 has 216 values col 1 has 176 values col 0 has 1 values 10, 9, 8, 7, 6, 5, 3, 4, 2, 1, 0 50001 rows to process. 50001 rows to process. cols(10) give 37754 rows. 50001 rows to process. cols(10,9) give 49996 rows. 50001 rows to process. cols(10,9,8) give 50001 rows.

      However, I'd wonder what the utility of it would be...

      I was thinking that one possible improvement of my code would be to remove the "duplicate" rows after determining each key, then redo the column count so you could get the best remaining key from the list. It certainly wouldn't be the optimal solution, but I'd bet it gets pretty close.

      ...roboticus

      When your only tool is a hammer, all problems look like your thumb.

        My reason to do it backwards was because I thought that checking for whether a key could be removed or not was easier than checking if a key should be added. Because in the first case you only have to check whether one key is duplicate now (easy to do with a hash. In the second case you have to check whether the addition of the key improves anything.

        This is a big problem with your solution now. Say two columns have the same values, but with many different values so they are both high on the @colOrder list. They will both be added even though only one of them would improve the key quality, the other is completely useless.

        You also talk about an improvement by removing "duplicate" rows. Did you mean "singular" rows? I.e. a row that is already singled out/identified completely by the key isn't important anymore for finding relevant keys and could be removed. This is one possible way to check for whether a key improves things.

        One factor that also should be taking into account, if you want to determine whether adding or removing is better, is how many keys the solution likely needs. If the data is nearly as random as BrowserUKs test data, obviously the solution will only need a few columns. But if many columns are linked to each other and have only a few different values, the key needs many columns

        Though the more columns the key needs the worse it is as a key ;-). A key that needs 50 of the 100 columns already uses half the space of the database itself

Re: Finding Keys from data
by BrowserUk (Patriarch) on Mar 31, 2011 at 16:21 UTC

    Here is one way to do it:

    #! perl -slw use strict; use Algorithm::Combinatorics qw[ variations ]; my %uniq; while( <DATA> ) { chomp; my @fields = split ','; OUTER: for my $k ( 1 .. $#fields ) { my $iter = variations( \@fields, $k ); while( my $v = $iter->next ) { my $key = join ':', @$v; unless( exists $uniq{ $key } ) { $uniq{ $key } = $.; last OUTER; } } } } my @lineorder = sort{ $uniq{ $a } <=> $uniq{ $b } } keys %uniq; print "$uniq{ $_ } :: $_" for @lineorder; __DATA__ 2,6,5,2,1,1,7,9,8,6 5,8,5,0,9,3,8,9,0,2 2,3,1,1,5,2,9,7,8,3 0,1,3,7,6,2,4,3,7,5 4,6,2,8,6,4,1,5,4,3 4,6,7,2,0,9,6,5,0,9 5,6,2,4,3,7,1,9,3,5 2,5,7,1,0,0,0,5,8,5 3,8,1,4,9,2,5,8,1,0 5,2,2,2,0,7,2,8,3,1 7,1,2,6,5,4,0,9,2,5 1,6,3,7,3,8,7,0,7,7 0,0,8,9,9,8,3,3,6,0 0,2,5,3,8,4,1,8,9,4 5,6,9,0,6,4,9,5,0,7 9,0,9,3,2,6,3,2,4,6 3,3,0,4,8,5,7,7,2,4 3,1,3,0,0,3,1,7,3,8 0,6,7,0,8,9,4,8,4,8 0,2,0,3,7,4,6,8,4,5

    Output (line number :: uniq key}

    c:\test>896650 1 :: 2 2 :: 5 3 :: 3 4 :: 0 5 :: 4 6 :: 6 7 :: 7 8 :: 1 9 :: 8 10 :: 5:2 11 :: 9 12 :: 1:6 13 :: 0:0 14 :: 0:2 15 :: 5:6 16 :: 9:0 17 :: 3:3 18 :: 3:1 19 :: 0:6 20 :: 0:3

    But do realise that this is a massively combinatorial problem, and given the size of your data, if there is a high degree of similarity in your lines, it could take a very long time.

    And, the results produced are in no way 'optimal'. That is to say, there is no attempt here to produce the set of shortest signatures. To do so would make this an NP hard problem of a massive scale.

    By way of example: With 100 fields, there are ~10^157 variations of fields for each line. To find the optimal solution, you'd need to try (something like) 10^157! permutations. That number is so massive as to be impossible to calculate--never mind performing that number of trials--even if you had all the computers and storage in the world.


    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.

      Why 10^257? A column is either in or not in the key. That would mean 2^100 possible solutions to try

      Or are you refering only to the worst case of your algorithm? Seems using permutation/variation might lead to much overhead in the search. If the combination 2:4 were already tested for example, testing 4:2 is not necessary anymore

        Why 10^257? A column is either in or not in the key. That would mean 2^100 possible solutions to try

        For a unique key 'ab' is different to 'ba'.

        So, it should be 100! + 99! + 98! + 97! ... so I rounded up 10^157.

        To be honest, given the scale of the numbers, it makes little difference.


        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        "Science is about questioning the status quo. Questioning authority".
        In the absence of evidence, opinion is indistinguishable from prejudice.
Re: Finding Keys from data
by BrowserUk (Patriarch) on Mar 31, 2011 at 18:56 UTC

    Using jethro/tye's interpretation of your description:

    #! perl -slw use strict; use Data::Dump qw[ pp ]; use Algorithm::Combinatorics qw[ combinations ]; my @lines = map [ do{ chomp; split ',' } ], <DATA>; my $w = $#{ $lines[0] }; my @keyFields; WIDTH: for my $k ( 1 .. $w ) { my $iter = combinations( [ 0 .. $w ], $k ); COMB: while( my $c = $iter->next ) { my %uniq; for my $a ( @lines ) { next COMB if exists $uniq{ join ',', @{ $a }[ @$c ] }; $uniq{ join ',', @{ $a }[ @$c ] } = 1; } @keyFields = @$c; last WIDTH; } } print "All lines are unique using a combination of fields:[ @keyFields + ]"; __DATA__ 2,6,5,2,1,1,7,9,8,6 5,8,5,0,9,3,8,9,0,2 2,3,1,1,5,2,9,7,8,3 0,1,3,7,6,2,4,3,7,5 4,6,2,8,6,4,1,5,4,3 4,6,7,2,0,9,6,5,0,9 5,6,2,4,3,7,1,9,3,5 2,5,7,1,0,0,0,5,8,5 3,8,1,4,9,2,5,8,1,0 5,2,2,2,0,7,2,8,3,1 7,1,2,6,5,4,0,9,2,5 1,6,3,7,3,8,7,0,7,7 0,0,8,9,9,8,3,3,6,0 0,2,5,3,8,4,1,8,9,4 5,6,9,0,6,4,9,5,0,7 9,0,9,3,2,6,3,2,4,6 3,3,0,4,8,5,7,7,2,4 3,1,3,0,0,3,1,7,3,8 0,6,7,0,8,9,4,8,4,8 0,2,0,3,7,4,6,8,4,5

    Output:

    c:\test>896650-2 All lines are unique using a combination of fields:[ 3 6 ]

    On a randomly generated dataset of the scale you've suggested, it takes around 30 seconds to complete:

    [19:42:44.71] c:\test>head -2 896650.dat 357,67,815,493,516,302,810,899,672,91,542,795,527,429,217,502,811,554, +345,312,689,336,710,194,778,869,711,413,402,313,960,351,264,511,558,3 +01,184,414,955,528,44,839,786,363,629,599,611,623,488,534,948,140,489 +,474,511,662,217,665,58,930,879,683,764,203,863,384,509,5,612,903,0,3 +82,969,240,130,460,652,474,478,562,7,117,360,688,702,657,329,626,521, +808,547,477,903,510,913,883,398,201,375,729 675,393,313,605,265,95,415,813,667,95,945,188,2,646,803,534,842,589,29 +9,934,395,429,34,733,684,412,799,463,896,130,896,505,530,669,793,750, +527,865,514,55,25,659,668,780,892,119,532,163,707,132,85,841,327,314, +408,284,714,166,344,425,165,998,843,272,612,671,873,772,134,665,331,7 +26,356,522,838,22,875,191,835,965,373,304,885,7,586,908,588,623,764,7 +02,4,656,794,289,18,872,639,495,513,35 [19:54:27.66] c:\test>wc -l 896650.dat 10000 896650.dat [19:54:40.64] c:\test>896650-2 896650.dat All lines are unique using a combination of fields:[ 0 1 2 ] [19:55:10.04] c:\test>

    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.
Re: Finding Keys from data
by wind (Priest) on Mar 31, 2011 at 15:38 UTC

    Could always just use a foreign key like the line number ($.)? That would be unique for yours or any given csv file. We aren't really going to be able to advise you on how to refer to your data more specifically without knowing anything about it.

    One small note is that be sure to check out Text::CSV and Text::CSV_XS for processing of csv files.

      Sorry, I cannot use additional columns such as line numbers. That will not serve the purpose. I have to find the keys from existing data and take these keys and compare with some other data, which has additional columns and rows.
Re: Finding Keys from data (where to start)
by tye (Sage) on Mar 31, 2011 at 17:36 UTC

    My approach would be:

    1. Find the most common value for each column, $mode[$col].
    2. Count how many times that value occurs in its column, $freq[$col].
    3. Assign each column the value, $entropy[$col]= $rows/$freq[$col].
    4. Sort @entropy (descending).
    5. Multiply the largest values of @entropy until you get to a value close to $rows.
    6. Build a 'key' from the columns that contributed the entropy values that produced the near-$rows product.
    7. Compute $mode and $freq for that key.
    8. If $freq is 1, then drop the last column and try again, looking for a more minimal solution.
    9. If $freq is more than 1, add the column that contributes the next-smaller value from @entropy.

    After you've found the minimal solution using the initial ordering, you can look for "near by" solutions that use fewer columns by randomly removing a column (weighted somewhat toward the selected columns providing the least entropy) and, when required, randomly adding a new column (weighted somewhat toward the unselected columns providing the most entropy). Continue this until your solution seems adequate for the time you've spent searching for it.

    You will likely end up building a list of several equally or nearly equally minimal solutions and working through that list trying to adjust each one so that it becomes more minimal. This is at least similar to an asexual "genetic" algorithm (I wouldn't bother "breeding" two solutions together). You'd build a population of "more successful" solutions were each generation adds several modified copies of each solution and culls the least successful of the new population (the most successful solutions get the most modified copies of themselves added). In that scenario, it would be worth caching previously culled potential solutions so you don't waste time recomputing their fitness.

    If you're data is interestingly correlated, then you might want to use a different formula for $entropy. For example, if you have a couple of columns that have a pattern of values like "none,1", "none,2", "none,3",... "none,100", "a,0", "b,0",... "z,0"... then each column might have a entropy of only 2 but together they actually form a unique key.

    In such a case, you should probably modify the computation of $entropy to take into account the number of unique values in a column, not just the frequency of the least unique value. The minimum entropy would be $rows/$freq and the maximum entropy would be $uniqs. I'd probably first try combining the two using an geometric mean: $entropy= sqrt($uniqs*$rows/$freq).

    Although I describe "find the mode" and "find the frequency of the mode" as two separate steps, it is more likely that you'd perform a single step that would then give you $freq and $uniqs for a column ($mode is only useful in explaining the approach and isn't actually useful writing the code).

    I also say "sort @entropy" but clearly you'd want to "sort { $entropy[$b] <=> $entropy[$a] } 0..$#entropy".

    Update: There may also be cases where it is helpful to compute the actual combined "entropy" of pairs or small groups of columns and then add more successful groups together. However, the way I outlined above is going to be at least close to the fastest way to come up with a solution that has decent odds of not being among the worst of the solutions. Looking at pairs after that might allow refining to progress more quickly.

    - tye