shaezi has asked for the wisdom of the Perl Monks concerning the following question:

Hi,

am attempting to remove duplicates from sorted data by keeping the last entry in a set of matching keys. For example consider the following data:

02626216000.00001.00001.00001.00001.00
02626216000.00002.00002.00002.00002.00
02626216000.00005.00000.00005.00005.00

The key is the first 8 bytes of data, followed by data buckets.
If I use the following code:

foreach (@sorted_data) { push(@data_out, $_) unless ($seen{substr($_,0,7)}++); }

I eliminate the duplicate records but keep the first occurance. Is there a way to use the same hash but keep the last record of a set of recurring keys? Basically the records are sorted in ascending order and I want to keep the record (in a set of duplicates) that has the highest data values.
i.e. keep:
02626216000.00005.00000.00005.00005.00
instead of:
02626216000.00001.00001.00001.00001.00

Thanks!

Replies are listed 'Best First'.
Re: removal of dupes using a hash
by Limbic~Region (Chancellor) on May 20, 2004 at 20:50 UTC
    shaezi,
    Instead of pushing to an array if it isn't present in a hash, just use a hash since the key will remain the same but the value will keep changing. You can use an array to keep insert order since it seems to be important to you.
    for ( @data ) { my $key = substr($_, 0, 7); push @order, $key if ! defined $unique{$key}; $unique{$key} = $_; } print "$_ : $unique{$key}\n" for @data{@order};
    Cheers - L~R
Re: removal of dupes using a hash
by pizza_milkshake (Monk) on May 20, 2004 at 21:09 UTC
    assuming everything's already in the right order, just keep assigning the latest value to the key and at the end you'll have all your values in a hash. this is the quick and dirty way which assumes you don't have 100Ks to Ms of keys. if so, write it a little more C-like.
    #!perl -wl use strict; my %h; $h{substr($_, 0, 1)} = substr($_, 2, 1) while <DATA>; print "$_.$h{$_}" for sort keys %h; __DATA__ 1.1 1.2 1.3 2.0 2.1 2.2 3.4 4.5 4.6

    perl -e"\$_=qq/nwdd\x7F^n\x7Flm{{llql0}qs\x14/;s/./chr(ord$&^30)/ge;print"

Re: removal of dupes using a hash
by sleepingsquirrel (Chaplain) on May 20, 2004 at 22:45 UTC
    This isn't the model of efficiency but you could try...
    @data_out = uniq(@sorted_data); sub uniq { my @xs=@_; my $x=shift @xs; return $x unless @xs; return (uniq(@xs)) if grep {substr($x,0,7) eq substr($_,0,7)} @xs; return ($x,uniq(@xs)); }
    ...or a semi-brute force method like...
    foreach (reverse @sorted_data) { unshift(@data_out, $_) unless ($seen{substr($_,0,7)}++); }
    Update: Fixed a bug in uniq (it added a spurious undef to the end of the array) Here's another subroutine I'm a little more fond of...
    @data_out = nub(@sorted_data); sub nub { my @xs=@_; my $x=pop @xs; return $x unless @xs; return (nub(grep {substr($x,0,7) ne substr($_,0,7)} @xs), $x); }
      ...better yet...
      while($_=pop @sorted_data) { unshift(@data_out,$_) unless ($seen{substr($_,0,7)}++); }
Re: removal of dupes using a hash
by johndageek (Hermit) on May 20, 2004 at 21:12 UTC
    foreach (@sorted_data) { /\./; $hash{$`}=$'; } # print hash here by key value - only latest entry exists
    Enjoy!
    Dageek
      The specifics in this reply are slightly less than ideal. When Perl sees you using $` and $' even once it assumes you might want to use these variables for each and every regular expression thereafter.

      This is considered a bad thing because Perl will then copy the prematched text and the postmatched text into $` and $' every time it sees a regular expression. This is okay in this kind of example because the data we're dealing with appears to be small. However it rapidly becomes inefficient once we start dealing with longer strings.

      A similar, slightly more efficient version can be written:

      foreach (@sorted_data) { /(.*?)\.(.*)/; $hash{$1} = $2; } # print hash here by key value - only latest entry exists

      It may also be worth wondering why a regular expression is needed at all:

      foreach (@sorted_data) { my ($key, $value) = split /\./, $_, 2; $hash{$key} = $value; } # print hash here by key value - only latest entry exists

      Hope this helps,

      jarich

Re: removal of dupes using a hash
by deibyz (Hermit) on May 21, 2004 at 13:37 UTC
    I haven't too much experience with perl, but I was playing with anything like this a few days ago. I readed the Perl Idioms Explained section and get to something like that (for this example):
    my %nodups = %{{map{ split /\./ , $_ , 2; } @values}};
    I've tested it and it seems to work, but I'm not very sure on some questions:
    Does map keeps the order (It seems that it does)? Is this option going to get to much memory with a large input?

    Thanks,
    deibyz
Re: removal of dupes using a hash
by sleepingsquirrel (Chaplain) on May 21, 2004 at 16:48 UTC
    Here's a snippet which creates an array of references to the desired values...
    $l=0; while($l<$#sorted) { push @data_out_ref, \$sorted[$l] if substr($sorted[$l],0,7) ne subs +tr($sorted[$l++ +1],0,7); } print "$$_\n" for @data_out_ref;
    ...and here's one which removes the elements from the original array...
    for($l=0; $l<$#sorted; $l++) { splice(@sorted, $l, 1) if substr($sorted[$l],0,7) eq substr($sorted[$ +l+1],0,7); }