in reply to How to eliminate redundancy in huge dataset (1,000 - 10,000)

Nowadays, a dataset with 1,000 - 10,000 elements is not huge any more!

The common way to eliminate duplicates in Perl is to use a hash:

my %seen; while (<>) { my @parts = split /\|/; next if $seen{$parts[1]}++; print; }

If the data were huge and for your particular case where the key is an integer, you could use a bit vector to record the seen entries in order to reduce memory consumption.

And for really huge data sets, an external sort & postprocess algorithm would be the better aproach.

Replies are listed 'Best First'.
A reply falls below the community's threshold of quality. You may see it by logging in.