Here is one way to do it:
#! perl -slw use strict; use Algorithm::Combinatorics qw[ variations ]; my %uniq; while( <DATA> ) { chomp; my @fields = split ','; OUTER: for my $k ( 1 .. $#fields ) { my $iter = variations( \@fields, $k ); while( my $v = $iter->next ) { my $key = join ':', @$v; unless( exists $uniq{ $key } ) { $uniq{ $key } = $.; last OUTER; } } } } my @lineorder = sort{ $uniq{ $a } <=> $uniq{ $b } } keys %uniq; print "$uniq{ $_ } :: $_" for @lineorder; __DATA__ 2,6,5,2,1,1,7,9,8,6 5,8,5,0,9,3,8,9,0,2 2,3,1,1,5,2,9,7,8,3 0,1,3,7,6,2,4,3,7,5 4,6,2,8,6,4,1,5,4,3 4,6,7,2,0,9,6,5,0,9 5,6,2,4,3,7,1,9,3,5 2,5,7,1,0,0,0,5,8,5 3,8,1,4,9,2,5,8,1,0 5,2,2,2,0,7,2,8,3,1 7,1,2,6,5,4,0,9,2,5 1,6,3,7,3,8,7,0,7,7 0,0,8,9,9,8,3,3,6,0 0,2,5,3,8,4,1,8,9,4 5,6,9,0,6,4,9,5,0,7 9,0,9,3,2,6,3,2,4,6 3,3,0,4,8,5,7,7,2,4 3,1,3,0,0,3,1,7,3,8 0,6,7,0,8,9,4,8,4,8 0,2,0,3,7,4,6,8,4,5
Output (line number :: uniq key}
c:\test>896650 1 :: 2 2 :: 5 3 :: 3 4 :: 0 5 :: 4 6 :: 6 7 :: 7 8 :: 1 9 :: 8 10 :: 5:2 11 :: 9 12 :: 1:6 13 :: 0:0 14 :: 0:2 15 :: 5:6 16 :: 9:0 17 :: 3:3 18 :: 3:1 19 :: 0:6 20 :: 0:3
But do realise that this is a massively combinatorial problem, and given the size of your data, if there is a high degree of similarity in your lines, it could take a very long time.
And, the results produced are in no way 'optimal'. That is to say, there is no attempt here to produce the set of shortest signatures. To do so would make this an NP hard problem of a massive scale.
By way of example: With 100 fields, there are ~10^157 variations of fields for each line. To find the optimal solution, you'd need to try (something like) 10^157! permutations. That number is so massive as to be impossible to calculate--never mind performing that number of trials--even if you had all the computers and storage in the world.
In reply to Re: Finding Keys from data
by BrowserUk
in thread Finding Keys from data
by aartist
| For: | Use: | ||
| & | & | ||
| < | < | ||
| > | > | ||
| [ | [ | ||
| ] | ] |