My approach would be:
After you've found the minimal solution using the initial ordering, you can look for "near by" solutions that use fewer columns by randomly removing a column (weighted somewhat toward the selected columns providing the least entropy) and, when required, randomly adding a new column (weighted somewhat toward the unselected columns providing the most entropy). Continue this until your solution seems adequate for the time you've spent searching for it.
You will likely end up building a list of several equally or nearly equally minimal solutions and working through that list trying to adjust each one so that it becomes more minimal. This is at least similar to an asexual "genetic" algorithm (I wouldn't bother "breeding" two solutions together). You'd build a population of "more successful" solutions were each generation adds several modified copies of each solution and culls the least successful of the new population (the most successful solutions get the most modified copies of themselves added). In that scenario, it would be worth caching previously culled potential solutions so you don't waste time recomputing their fitness.
If you're data is interestingly correlated, then you might want to use a different formula for $entropy. For example, if you have a couple of columns that have a pattern of values like "none,1", "none,2", "none,3",... "none,100", "a,0", "b,0",... "z,0"... then each column might have a entropy of only 2 but together they actually form a unique key.
In such a case, you should probably modify the computation of $entropy to take into account the number of unique values in a column, not just the frequency of the least unique value. The minimum entropy would be $rows/$freq and the maximum entropy would be $uniqs. I'd probably first try combining the two using an geometric mean: $entropy= sqrt($uniqs*$rows/$freq).
Although I describe "find the mode" and "find the frequency of the mode" as two separate steps, it is more likely that you'd perform a single step that would then give you $freq and $uniqs for a column ($mode is only useful in explaining the approach and isn't actually useful writing the code).
I also say "sort @entropy" but clearly you'd want to "sort { $entropy[$b] <=> $entropy[$a] } 0..$#entropy".
Update: There may also be cases where it is helpful to compute the actual combined "entropy" of pairs or small groups of columns and then add more successful groups together. However, the way I outlined above is going to be at least close to the fastest way to come up with a solution that has decent odds of not being among the worst of the solutions. Looking at pairs after that might allow refining to progress more quickly.
- tye
In reply to Re: Finding Keys from data (where to start)
by tye
in thread Finding Keys from data
by aartist
| For: | Use: | ||
| & | & | ||
| < | < | ||
| > | > | ||
| [ | [ | ||
| ] | ] |