Re: How to eliminate redundancy in huge dataset (1,000

Nowadays, a dataset with 1,000 - 10,000 elements is not huge any more!

The common way to eliminate duplicates in Perl is to use a hash:

my %seen;
while (<>) {
  my @parts = split /\|/;
  next if $seen{$parts[1]}++;
  print;
}
[download]

If the data were huge and for your particular case where the key is an integer, you could use a bit vector to record the seen entries in order to reduce memory consumption.

And for really huge data sets, an external sort & postprocess algorithm would be the better aproach.

Comment on Re: How to eliminate redundancy in huge dataset (1,000 - 10,000) Download Code