in reply to How to eliminate redundancy in huge dataset (1,000 - 10,000)
The common way to eliminate duplicates in Perl is to use a hash:
my %seen; while (<>) { my @parts = split /\|/; next if $seen{$parts[1]}++; print; }
If the data were huge and for your particular case where the key is an integer, you could use a bit vector to record the seen entries in order to reduce memory consumption.
And for really huge data sets, an external sort & postprocess algorithm would be the better aproach.
|
|---|
| Replies are listed 'Best First'. | |
|---|---|
| A reply falls below the community's threshold of quality. You may see it by logging in. |