aartist has asked for the wisdom of the Perl Monks concerning the following question:
Thanks.
|
|---|
| Replies are listed 'Best First'. | |
|---|---|
|
Re: Finding Keys from data
by moritz (Cardinal) on Mar 31, 2011 at 15:46 UTC | |
If this is for managing database-like data, don't do it. Add an artificial primary key. If you really need it, and your XY Problem has no better solution, you need brute force. If you want a faster but approximate solution, read up on greedy algorithms. | [reply] |
|
Re: Finding Keys from data
by jethro (Monsignor) on Mar 31, 2011 at 15:46 UTC | |
Finding an optimal solution may not be possible (without searching the complete space of 2^100 possibilities) but you might try something like this: 1) For each column count the number of different values in it (easily done with hashes) 2) Sort that column list lowest number first 3) Set a key of all columns 3) Foreach column on that list try to remove it from the key and see if two lines have the same key now. if yes, re-add the column to your key, otherwise leave it off This will give you exactly one solution. If you want to look for better solutions, you might rerun the algorithm, but scramble the sorted list of columns somewhat to get at different solutions. You could even use a totally random column list, but most likely you will get too many suboptimal solutions presented that way | [reply] |
by roboticus (Chancellor) on Apr 01, 2011 at 10:57 UTC | |
I like that approach, but rather than excluding columns one-by-one, I'd include them one-by-one, like this:
However, I'd wonder what the utility of it would be... I was thinking that one possible improvement of my code would be to remove the "duplicate" rows after determining each key, then redo the column count so you could get the best remaining key from the list. It certainly wouldn't be the optimal solution, but I'd bet it gets pretty close. ...roboticus When your only tool is a hammer, all problems look like your thumb. | [reply] [d/l] |
by jethro (Monsignor) on Apr 01, 2011 at 11:39 UTC | |
My reason to do it backwards was because I thought that checking for whether a key could be removed or not was easier than checking if a key should be added. Because in the first case you only have to check whether one key is duplicate now (easy to do with a hash. In the second case you have to check whether the addition of the key improves anything. This is a big problem with your solution now. Say two columns have the same values, but with many different values so they are both high on the @colOrder list. They will both be added even though only one of them would improve the key quality, the other is completely useless. You also talk about an improvement by removing "duplicate" rows. Did you mean "singular" rows? I.e. a row that is already singled out/identified completely by the key isn't important anymore for finding relevant keys and could be removed. This is one possible way to check for whether a key improves things. One factor that also should be taking into account, if you want to determine whether adding or removing is better, is how many keys the solution likely needs. If the data is nearly as random as BrowserUKs test data, obviously the solution will only need a few columns. But if many columns are linked to each other and have only a few different values, the key needs many columns Though the more columns the key needs the worse it is as a key ;-). A key that needs 50 of the 100 columns already uses half the space of the database itself | [reply] |
by roboticus (Chancellor) on Apr 01, 2011 at 12:12 UTC | |
|
Re: Finding Keys from data
by BrowserUk (Patriarch) on Mar 31, 2011 at 16:21 UTC | |
Here is one way to do it:
Output (line number :: uniq key}
But do realise that this is a massively combinatorial problem, and given the size of your data, if there is a high degree of similarity in your lines, it could take a very long time. And, the results produced are in no way 'optimal'. That is to say, there is no attempt here to produce the set of shortest signatures. To do so would make this an NP hard problem of a massive scale. By way of example: With 100 fields, there are ~10^157 variations of fields for each line. To find the optimal solution, you'd need to try (something like) 10^157! permutations. That number is so massive as to be impossible to calculate--never mind performing that number of trials--even if you had all the computers and storage in the world. Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.
| [reply] [d/l] [select] |
by jethro (Monsignor) on Mar 31, 2011 at 17:02 UTC | |
Why 10^257? A column is either in or not in the key. That would mean 2^100 possible solutions to try Or are you refering only to the worst case of your algorithm? Seems using permutation/variation might lead to much overhead in the search. If the combination 2:4 were already tested for example, testing 4:2 is not necessary anymore | [reply] |
by BrowserUk (Patriarch) on Mar 31, 2011 at 17:53 UTC | |
Why 10^257? A column is either in or not in the key. That would mean 2^100 possible solutions to try For a unique key 'ab' is different to 'ba'. So, it should be 100! + 99! + 98! + 97! ... so I rounded up 10^157. To be honest, given the scale of the numbers, it makes little difference. Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.
| [reply] |
by tye (Sage) on Mar 31, 2011 at 18:05 UTC | |
|
Re: Finding Keys from data
by BrowserUk (Patriarch) on Mar 31, 2011 at 18:56 UTC | |
Output:
On a randomly generated dataset of the scale you've suggested, it takes around 30 seconds to complete:
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.
| [reply] [d/l] [select] |
|
Re: Finding Keys from data
by wind (Priest) on Mar 31, 2011 at 15:38 UTC | |
Could always just use a foreign key like the line number ($.)? That would be unique for yours or any given csv file. We aren't really going to be able to advise you on how to refer to your data more specifically without knowing anything about it. One small note is that be sure to check out Text::CSV and Text::CSV_XS for processing of csv files. | [reply] [d/l] |
by aartist (Pilgrim) on Mar 31, 2011 at 15:47 UTC | |
| [reply] |
|
Re: Finding Keys from data (where to start)
by tye (Sage) on Mar 31, 2011 at 17:36 UTC | |
My approach would be: After you've found the minimal solution using the initial ordering, you can look for "near by" solutions that use fewer columns by randomly removing a column (weighted somewhat toward the selected columns providing the least entropy) and, when required, randomly adding a new column (weighted somewhat toward the unselected columns providing the most entropy). Continue this until your solution seems adequate for the time you've spent searching for it. You will likely end up building a list of several equally or nearly equally minimal solutions and working through that list trying to adjust each one so that it becomes more minimal. This is at least similar to an asexual "genetic" algorithm (I wouldn't bother "breeding" two solutions together). You'd build a population of "more successful" solutions were each generation adds several modified copies of each solution and culls the least successful of the new population (the most successful solutions get the most modified copies of themselves added). In that scenario, it would be worth caching previously culled potential solutions so you don't waste time recomputing their fitness. If you're data is interestingly correlated, then you might want to use a different formula for $entropy. For example, if you have a couple of columns that have a pattern of values like "none,1", "none,2", "none,3",... "none,100", "a,0", "b,0",... "z,0"... then each column might have a entropy of only 2 but together they actually form a unique key. In such a case, you should probably modify the computation of $entropy to take into account the number of unique values in a column, not just the frequency of the least unique value. The minimum entropy would be $rows/$freq and the maximum entropy would be $uniqs. I'd probably first try combining the two using an geometric mean: $entropy= sqrt($uniqs*$rows/$freq). Although I describe "find the mode" and "find the frequency of the mode" as two separate steps, it is more likely that you'd perform a single step that would then give you $freq and $uniqs for a column ($mode is only useful in explaining the approach and isn't actually useful writing the code). I also say "sort @entropy" but clearly you'd want to "sort { $entropy[$b] <=> $entropy[$a] } 0..$#entropy". Update: There may also be cases where it is helpful to compute the actual combined "entropy" of pairs or small groups of columns and then add more successful groups together. However, the way I outlined above is going to be at least close to the fastest way to come up with a solution that has decent odds of not being among the worst of the solutions. Looking at pairs after that might allow refining to progress more quickly. - tye | [reply] [d/l] [select] |