removal of dupes using a hash

shaezi has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: removal of dupes using a hash by Limbic~Region (Chancellor) on May 20, 2004 at 20:50 UTC
shaezi, Instead of pushing to an array if it isn't present in a hash, just use a hash since the key will remain the same but the value will keep changing. You can use an array to keep insert order since it seems to be important to you. `for ( @data ) { my $key = substr($_, 0, 7); push @order, $key if ! defined $unique{$key}; $unique{$key} = $_; } print "$_ : $unique{$key}\n" for @data{@order};` [download] Cheers - L~R	[reply] [d/l]
Re: removal of dupes using a hash by pizza_milkshake (Monk) on May 20, 2004 at 21:09 UTC
assuming everything's already in the right order, just keep assigning the latest value to the key and at the end you'll have all your values in a hash. this is the quick and dirty way which assumes you don't have 100Ks to Ms of keys. if so, write it a little more C-like. `#!perl -wl use strict; my %h; $h{substr($_, 0, 1)} = substr($_, 2, 1) while <DATA>; print "$_.$h{$_}" for sort keys %h; __DATA__ 1.1 1.2 1.3 2.0 2.1 2.2 3.4 4.5 4.6` [download] perl -e"\$_=qq/nwdd\x7F^n\x7Flm{{llql0}qs\x14/;s/./chr(ord$&^30)/ge;print"	[reply] [d/l]
Re: removal of dupes using a hash by sleepingsquirrel (Chaplain) on May 20, 2004 at 22:45 UTC
This isn't the model of efficiency but you could try... `@data_out = uniq(@sorted_data); sub uniq { my @xs=@_; my $x=shift @xs; return $x unless @xs; return (uniq(@xs)) if grep {substr($x,0,7) eq substr($_,0,7)} @xs; return ($x,uniq(@xs)); }` [download] ...or a semi-brute force method like... `foreach (reverse @sorted_data) { unshift(@data_out, $_) unless ($seen{substr($_,0,7)}++); }` [download] Update: Fixed a bug in `uniq` (it added a spurious `undef` to the end of the array) Here's another subroutine I'm a little more fond of... `@data_out = nub(@sorted_data); sub nub { my @xs=@_; my $x=pop @xs; return $x unless @xs; return (nub(grep {substr($x,0,7) ne substr($_,0,7)} @xs), $x); }` [download]	[reply] [d/l] [select]
Re: Re: removal of dupes using a hash by sleepingsquirrel (Chaplain) on May 20, 2004 at 23:15 UTC
...better yet... `while($_=pop @sorted_data) { unshift(@data_out,$_) unless ($seen{substr($_,0,7)}++); }` [download]	[reply] [d/l]
Re: removal of dupes using a hash by johndageek (Hermit) on May 20, 2004 at 21:12 UTC
foreach (@sorted_data) { /\./; $hash{$`}=$'; } # print hash here by key value - only latest entry exists [download] Enjoy! Dageek	[reply] [d/l]
Re: Re: removal of dupes using a hash by jarich (Curate) on May 21, 2004 at 01:05 UTC
The specifics in this reply are slightly less than ideal. When Perl sees you using $` and `$'` even once it assumes you might want to use these variables for each and every regular expression thereafter. This is considered a bad thing because Perl will then copy the prematched text and the postmatched text into $` and `$'` every time it sees a regular expression. This is okay in this kind of example because the data we're dealing with appears to be small. However it rapidly becomes inefficient once we start dealing with longer strings. A similar, slightly more efficient version can be written: `foreach (@sorted_data) { /(.?)\.(.)/; $hash{$1} = $2; } # print hash here by key value - only latest entry exists` [download] It may also be worth wondering why a regular expression is needed at all: `foreach (@sorted_data) { my ($key, $value) = split /\./, $_, 2; $hash{$key} = $value; } # print hash here by key value - only latest entry exists` [download] Hope this helps, jarich	[reply] [d/l] [select]
Re: removal of dupes using a hash by deibyz (Hermit) on May 21, 2004 at 13:37 UTC
I haven't too much experience with perl, but I was playing with anything like this a few days ago. I readed the Perl Idioms Explained section and get to something like that (for this example): `my %nodups = %{{map{ split /\./ , $_ , 2; } @values}};` [download] I've tested it and it seems to work, but I'm not very sure on some questions: Does map keeps the order (It seems that it does)? Is this option going to get to much memory with a large input? Thanks, deibyz	[reply] [d/l]
Re: removal of dupes using a hash by sleepingsquirrel (Chaplain) on May 21, 2004 at 16:48 UTC
Here's a snippet which creates an array of references to the desired values... `$l=0; while($l<$#sorted) { push @data_out_ref, \$sorted[$l] if substr($sorted[$l],0,7) ne subs +tr($sorted[$l++ +1],0,7); } print "$$_\n" for @data_out_ref;` [download] ...and here's one which removes the elements from the original array... `for($l=0; $l<$#sorted; $l++) { splice(@sorted, $l, 1) if substr($sorted[$l],0,7) eq substr($sorted[$ +l+1],0,7); }` [download]	[reply] [d/l] [select]