What is the fastest way to delete duplicates from multi dimensional array ?

geogpx has asked for the wisdom of the Perl Monks concerning the following question:

The data i've got output was not yet what i expected. So to debug, i've sorted the data en printed it

@$points = sort { $a->[3] <=> $b->[3] } @$points;
foreach (@$points) {
  print " $_->[3] $_->[0] $_->[1] $_->[2] \n";
}
[download]

gives me

 1338020418 33.514447422 9.142337666 16.479736  
 1338020431 33.514425964 9.142852650 16.960449  
 1338020431 33.514425964 9.142852650 16.960449  
 1338020446 33.514318676 9.143496380 16.960449  
 1338020446 33.514318676 9.143496380 16.960449  
 1338020446 33.514318676 9.143496380 16.960449  
 1338020459 33.514211388 9.144140110 16.479736  
 1338020479 33.514125557 9.145019875 14.557007  
 1338020479 33.514125557 9.145019875 14.557007  
 1338020484 33.514104099 9.145234451 14.557007  
 1338020484 33.514104099 9.145234451 14.557007
[download]

and yes...there are duplicates. that need to be eliminated. I added the following

my @unique = uniq @$points;
[download]

but then my data is gone, only empty fields are printed. That is a good and fast way to delete duplicates in data ???

Comment on What is the fastest way to delete duplicates from multi dimensional array ? Select or Download Code

Replies are listed 'Best First'.
Re: What is the fastest way to delete duplicates from multi dimensional array ? by Corion (Patriarch) on May 29, 2012 at 13:42 UTC
Have you looked at the common way of eliminating duplicates, as shown by `perldoc -q duplicate`, available online in perlfaq4 ("How can I remove duplicate elements from a list or array")? What part of it is slow?	[reply] [d/l]
Re: What is the fastest way to delete duplicates from multi dimensional array ? by CountZero (Bishop) on May 29, 2012 at 14:40 UTC
Join all fields together and use a delimiter not in the character set (e.g. use '\|'), then throw it in a hash and split the keys again on that delimiter. Fast as the greased lightning. CountZero A program should be light and agile, its subroutines connected like a string of pearls. The spirit and intent of the program should be retained throughout. There should be neither too little or too much, neither needless loops nor useless variables, neither lack of structure nor overwhelming rigidity." - The Tao of Programming, 4.1 - Geoffrey James My blog: Imperial Deltronics	[reply]
Re: What is the fastest way to delete duplicates from multi dimensional array ? by Tanktalus (Canon) on May 29, 2012 at 19:09 UTC
Side note. Instead of your sort call, consider using Sort::Key's nkeysort_inplace: `use Sort::Key qw(nkeysort_inplace); nkeysort_inplace { $_->[3] } @$points;` [download] Read Sort::Key's docs to find out how to sort on multiple keys if you want to secondarily sort on other values of your points. The uniq function doesn't work for you because it treats everything as a string. Since you actually have a list of references (to arrays), not a list of simple scalars, this doesn't quite work. You will pretty much have to roll your own here. I'm sure there are ways to cheat, there always are, but it's probably more work than warranted. Something like setting up your points as objects that overload the `q[""]` operator to return whatever you want to determine uniqueness on - that might work. But I'm not sure about that. :-)	[reply] [d/l] [select]
Re: What is the fastest way to delete duplicates from multi dimensional array ? by kcott (Archbishop) on May 30, 2012 at 00:19 UTC
To achieve this, you can change `my @unique = uniq @$points;` [download] to `my %seen; @unique = grep { ! $seen{join(q{,}, @$_)}++ } @$points;` [download] Tested on the commandline using your posted data: $ perl -Mstrict -Mwarnings -e ' my ($points, %seen, @unique); # read your posted data while (<>) { push @$points => [split]; } # remove duplicates @unique = grep { ! $seen{join(q{,}, @$_)}++ } @$points; # print result print qq{@{$_}[0..3]\n} for @unique; ' 1338020418 33.514447422 9.142337666 16.479736 1338020431 33.514425964 9.142852650 16.960449 1338020431 33.514425964 9.142852650 16.960449 1338020446 33.514318676 9.143496380 16.960449 1338020446 33.514318676 9.143496380 16.960449 1338020446 33.514318676 9.143496380 16.960449 1338020459 33.514211388 9.144140110 16.479736 1338020479 33.514125557 9.145019875 14.557007 1338020479 33.514125557 9.145019875 14.557007 1338020484 33.514104099 9.145234451 14.557007 1338020484 33.514104099 9.145234451 14.557007 1338020418 33.514447422 9.142337666 16.479736 1338020431 33.514425964 9.142852650 16.960449 1338020446 33.514318676 9.143496380 16.960449 1338020459 33.514211388 9.144140110 16.479736 1338020479 33.514125557 9.145019875 14.557007 1338020484 33.514104099 9.145234451 14.557007 [download] As a side issue, note the use of the array slice on the last line. Your `foreach (@$points) { print " $_->[3] $_->[0] $_->[1] $_->[2] \n"; }` [download] could have been written as: `print " @{$_}[3,0,1,2] \n" for @$points;` [download] -- Ken	[reply] [d/l] [select]