comment on

My approach would be:

Find the most common value for each column, $mode[$col].
Count how many times that value occurs in its column, $freq[$col].
Assign each column the value, $entropy[$col]= $rows/$freq[$col].
Sort @entropy (descending).
Multiply the largest values of @entropy until you get to a value close to $rows.
Build a 'key' from the columns that contributed the entropy values that produced the near-$rows product.
Compute $mode and $freq for that key.
If $freq is 1, then drop the last column and try again, looking for a more minimal solution.
If $freq is more than 1, add the column that contributes the next-smaller value from @entropy.

After you've found the minimal solution using the initial ordering, you can look for "near by" solutions that use fewer columns by randomly removing a column (weighted somewhat toward the selected columns providing the least entropy) and, when required, randomly adding a new column (weighted somewhat toward the unselected columns providing the most entropy). Continue this until your solution seems adequate for the time you've spent searching for it.

You will likely end up building a list of several equally or nearly equally minimal solutions and working through that list trying to adjust each one so that it becomes more minimal. This is at least similar to an asexual "genetic" algorithm (I wouldn't bother "breeding" two solutions together). You'd build a population of "more successful" solutions were each generation adds several modified copies of each solution and culls the least successful of the new population (the most successful solutions get the most modified copies of themselves added). In that scenario, it would be worth caching previously culled potential solutions so you don't waste time recomputing their fitness.

If you're data is interestingly correlated, then you might want to use a different formula for $entropy. For example, if you have a couple of columns that have a pattern of values like "none,1", "none,2", "none,3",... "none,100", "a,0", "b,0",... "z,0"... then each column might have a entropy of only 2 but together they actually form a unique key.

In such a case, you should probably modify the computation of $entropy to take into account the number of unique values in a column, not just the frequency of the least unique value. The minimum entropy would be $rows/$freq and the maximum entropy would be $uniqs. I'd probably first try combining the two using an geometric mean: $entropy= sqrt($uniqs*$rows/$freq).

Although I describe "find the mode" and "find the frequency of the mode" as two separate steps, it is more likely that you'd perform a single step that would then give you $freq and $uniqs for a column ($mode is only useful in explaining the approach and isn't actually useful writing the code).

I also say "sort @entropy" but clearly you'd want to "sort { $entropy[$b] <=> $entropy[$a] } 0..$#entropy".

Update: There may also be cases where it is helpful to compute the actual combined "entropy" of pairs or small groups of columns and then add more successful groups together. However, the way I outlined above is going to be at least close to the fastest way to come up with a solution that has decent odds of not being among the worst of the solutions. Looking at pairs after that might allow refining to progress more quickly.

- tye

In reply to Re: Finding Keys from data (where to start) by tye
in thread Finding Keys from data by aartist

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.