Removing Duplicates from Array Passed by Reference

arunhorne has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: Removing Duplicates from Array Passed by Reference by broquaint (Abbot) on May 16, 2003 at 13:57 UTC
Ditch the prototype and use the trinary conditional operator in the hash slice (see. Far More Than Everything You've Ever Wanted to Know about Prototypes in Perl and When to use Prototypes? for further pros and cons on the matter) `sub util_remove_duplicates { my %hash; undef %hash; @hash{ref $_[0] eq 'ARRAY' ? @{$_[0]} : @_} = (); return keys %hash; }` [download] That should do the trick. Update: and something that'll actually do what you requested `sub util_remove_duplicates { @{$_[0]} = keys %{ { map{$_=>1} @{$_[0]} } }; return; }` [download] HTH `_________ broquaint`	[reply] [d/l] [select]
Re: Removing Duplicates from Array Passed by Reference by Limbic~Region (Chancellor) on May 16, 2003 at 13:59 UTC
arunhorne, You have a prototype, which is a little different than a regular subroutine. I will leave reading about the prototype to you and how to get it to work with either arrays or references to arrays. `sub util_remove_duplicates { my $array_ref = shift; my %hash; @hash{@{$array_ref}} = (); return keys %hash; }` [download] There are a couple of things to point out. Check out wantarray to see if the sub is being called in list or scalar context - that way you can return the list or a ref You do not need to undef the hash after you create it with my You could reduce it even further, but it becomes less readable Cheers - L~R	[reply] [d/l]
Re: Removing Duplicates from Array Passed by Reference by Skeeve (Parson) on May 16, 2003 at 14:20 UTC
Did I get you right? The code should edit it "in place" and not return a new array? This works but looses any sorting you had: `sub util_remove_duplicates { my %hash; @hash{@{$_[0]}}=(); @{$_[0]}=keys %hash; }` [download]	[reply] [d/l]
Re: Removing Duplicates from Array Passed by Reference by Rhandom (Curate) on May 16, 2003 at 14:47 UTC
A little more optimized than the first few - and more true to the request - plus keeping them in order - plus features. sub util_remove_duplicates { my $ref = ref($_[0]) ? shift() : [@_]; my %hash; for (my $i = 0; $i <= $#$ref; $i ++) { if (! $hash{$ref->[$i]}) { $hash{$ref->[$i]} = 1; next; } splice @$ref, $i, 1; $i --; } return wantarray ? @$ref : $ref; } my @a = qw(a a b c d a d c a e); util_remove_duplicates(\@a); # works my @B = util_remove_duplicates(\@a); # works and returns array to @B my $B = util_remove_duplicates(\@a); # works and returns array ref to +$B my $B = util_remove_duplicates(@a); # works without modifying @a [download] my @a=qw(random brilliant braindead); print $a[rand(@a)];	[reply] [d/l]
Re: Removing Duplicates from Array Passed by Reference by BrowserUk (Patriarch) on May 17, 2003 at 00:46 UTC
From what I could see, none of the answers above actually address the problem of creating huge temporary lists when de-duping very large arrays. `#! perl -slw use strict; sub dedup { my $ref = shift; my ($i, %h) = 0; while( $i < @$ref ) { my $v = $ref->[$i]; unless( exists $h{$v} ) { $h{$v}=undef; $i++; next; } splice @{$ref}, $i, 1; } } my @a = (7, 20..30, 1..1000, 200..300, 7); print scalar @a; dedup \@a; print scalar @a; #print "@a";` [download] This avoids the temporary lists by scanning the array, noting what it has seen and spliceing out any duplicates as it goes. If your data tends to have long runs of duplicates as shown above, you can add logic to save the splice until a run of dups stops and then splice out the run in a single hit which is somewhat more efficient at the cost of complexity. `#! perl -slw use strict; sub dedup { my $ref = shift; my ($i, $count, %h) = (0, 0); while( $i < @$ref ) { my $v = $ref->[$i]; unless( exists $h{$v} ) { $h{$v}=undef; $i++; next unless $count; $i -= $count; splice @{$ref}, $i-1, $count; $count = 0; next; } $count++; $i++; } $i -= $count; splice @{$ref}, $i, $count; } my @a = (5, 1..2, 1..10, 2..5, 7); print scalar @a, ":@a"; dedup \@a; print scalar @a, ":@a";` [download] I can't help but think that some of the extra complexity and duplication could be factored out, but I haven't worked out how. (Yet:). Examine what is said, not who speaks. "Efficiency is intelligent laziness." -David Dunham "When I'm working on a problem, I never think about beauty. I think only how to solve the problem. But when I have finished, if the solution is not beautiful, I know it is wrong." -Richard Buckminster Fuller	[reply] [d/l] [select]
Re: Re: Removing Duplicates from Array Passed by Reference by Util (Priest) on May 17, 2003 at 03:24 UTC
For this particular problem, I believe that we can manually do the work of `splice` more efficiently than `splice` could itself. `# Untested :( sub dedup { my $ref = shift; my %seen; my $c = 0; # current my $n = 0; # next available slot foreach (@$ref) { if (not $seen{$_}++) { $ref->[$n] = $_ unless $n == $c; $n++; } $c++; } $#{$ref} = $n-1; }` [download] Every element gets moved once, at most. Elements before the first duplicate don't get moved at all.	[reply] [d/l] [select]
Re: Re: Re: Removing Duplicates from Array Passed by Reference by BrowserUk (Patriarch) on May 17, 2003 at 04:48 UTC
That's cool. From what I understand, which may well be wrong, splice doesn't actually move anything. It just unlink the deleted item or range from the chain and frees off the unlinked space? (Must re-read Perlguts Illustrated again sometime:). I'm wary of using `foreach (@$_)...` as I thought that it would expand the array to a list of aliases? I know that `for( 0..n) {...}` uses an iterator rather than a list, but I'm not sure about with an array. If that isn't the case, it would be good to know. It would be interesting to compare the performance (memory and speed). Maybe I'll have a go at it later. Update Perl wins again:) `D:\Perl\test>258641 Rate util___big splice_big util___sml splice_sml util___big 14.0/s -- -45% -98% -99% splice_big 25.4/s 81% -- -97% -98% util___sml 876/s 6151% 3348% -- -41% splice_sml 1490/s 10529% 5764% 70% --` [download] I'll post the benchmark if anyone is interested. Examine what is said, not who speaks. "Efficiency is intelligent laziness." -David Dunham "When I'm working on a problem, I never think about beauty. I think only how to solve the problem. But when I have finished, if the solution is not beautiful, I know it is wrong." -Richard Buckminster Fuller	[reply] [d/l] [select]
Re: Re: Removing Duplicates from Array Passed by Reference by Skeeve (Parson) on May 19, 2003 at 08:32 UTC
We can do much better by avoiding splice! `#! perl -slw use strict; sub dedup { my $ref = shift; my ($r, $l, %h); $r=$l=0; while( $r < @$ref ) { if ($h{$ref->[$r]}++) { # already seen ++$r; } else { # not yet seen. Copy! $ref->[$l++]= $ref->[$r++]; } }` [download] ~~`$#$ref= $l;`~~ `$#$ref=$l-1 } my @a = (7, 20..30, 1..1000, 200..300, 7); print scalar @a; dedup \@a; print scalar @a; #print "@a";` [download] This works by moving with two indizes (r-ight and l-eft) over the array, copying the right to the left if it hasn't be seen yet. At the end the length of the array is adjusted.	[reply] [d/l] [select]
Re: Re: Re: Removing Duplicates from Array Passed by Reference by BrowserUk (Patriarch) on May 19, 2003 at 10:06 UTC
Um:), by what criteria ... much* better? If you run the code from your post, but uncomment the last line, you'll see that your algorithm leaves a single duplicate at the end of the array. With the sample data in your post, this turns out to be the 989. Which is strange as this isn't a duplicate in the input array. That is probably easily fixed though. I had a similar problem with the first cut of my splice version. However, your algorithm is essentially the same as utils above at Re: Re: Removing Duplicates from Array Passed by Reference and as I showed in my reply at Re: Re: Re: Removing Duplicates from Array Passed by Reference, splice wins against the algorithm because splice doesn't have to copy* anything, as it simply unlinks the discarded elements from the linked list that implements perls arrays. I added your version to my benchmark and ignoring the error, splice beats it easily. Here are the results `Rate util___sml skeeve_sml splice_sml util___sml 1368/s -- -1% -11% skeeve_sml 1387/s 1% -- -10% splice_sml 1536/s 12% 11% -- Rate util___big skeeve_big splice_big util___big 17.1/s -- -12% -24% skeeve_big 19.4/s 13% -- -13% splice_big 22.4/s 31% 15% --` [download] As you can see, your implementation beat utils, but maybe once you've fixed the error that may no longer be true. Yell if you want the Benchmark code. Examine what is said, not who speaks. "Efficiency is intelligent laziness." -David Dunham "When I'm working on a problem, I never think about beauty. I think only how to solve the problem. But when I have finished, if the solution is not beautiful, I know it is wrong." -Richard Buckminster Fuller	[reply] [d/l]
Re: Re: Re: Re: Removing Duplicates from Array Passed by Reference by Skeeve (Parson) on May 19, 2003 at 11:40 UTC
Re: Removing Duplicates from Array Passed by Reference by fletcher_the_dog (Friar) on May 16, 2003 at 16:00 UTC
If you want to keep the order and edit in place, try this: `sub util_remove_duplicates{ my $ref=shift; my %hash; @$ref=grep{!$hash{$_}++} @$ref; } my @array=qw(a b c a j k l m o p q k); print "@array\n"; util_remove_duplicates(\@array); print "@array\n";` [download] OUTPUT: `a b c a j k l m o p q k a b c j k l m o p q` [download]	[reply] [d/l] [select]