arunhorne has asked for the wisdom of the Perl Monks concerning the following question:

Monks,

I have the following sub I wrote to remove duplicates from an array. I want to modify it so I can pass the array by reference and modify the contents of the referenced array. Thus benefits should be two-fold, saving of stack memory (handy for larger arrays) but also don't need to return a value as the actual passed array it modified.

Can anyone help me as I don't know how to do this. My original code it below:

sub util_remove_duplicates(@) { my %hash; undef %hash; @hash{@_} = (); return keys %hash; }
____________
Arun

Replies are listed 'Best First'.
Re: Removing Duplicates from Array Passed by Reference
by broquaint (Abbot) on May 16, 2003 at 13:57 UTC
Re: Removing Duplicates from Array Passed by Reference
by Limbic~Region (Chancellor) on May 16, 2003 at 13:59 UTC
    arunhorne,
    You have a prototype, which is a little different than a regular subroutine. I will leave reading about the prototype to you and how to get it to work with either arrays or references to arrays.
    sub util_remove_duplicates { my $array_ref = shift; my %hash; @hash{@{$array_ref}} = (); return keys %hash; }
    There are a couple of things to point out.
  • Check out wantarray to see if the sub is being called in list or scalar context - that way you can return the list or a ref
  • You do not need to undef the hash after you create it with my
  • You could reduce it even further, but it becomes less readable

    Cheers - L~R

Re: Removing Duplicates from Array Passed by Reference
by Skeeve (Parson) on May 16, 2003 at 14:20 UTC
    Did I get you right? The code should edit it "in place" and not return a new array? This works but looses any sorting you had:
    sub util_remove_duplicates { my %hash; @hash{@{$_[0]}}=(); @{$_[0]}=keys %hash; }
Re: Removing Duplicates from Array Passed by Reference
by Rhandom (Curate) on May 16, 2003 at 14:47 UTC
    A little more optimized than the first few - and more true to the request - plus keeping them in order - plus features.

    sub util_remove_duplicates { my $ref = ref($_[0]) ? shift() : [@_]; my %hash; for (my $i = 0; $i <= $#$ref; $i ++) { if (! $hash{$ref->[$i]}) { $hash{$ref->[$i]} = 1; next; } splice @$ref, $i, 1; $i --; } return wantarray ? @$ref : $ref; } my @a = qw(a a b c d a d c a e); util_remove_duplicates(\@a); # works my @B = util_remove_duplicates(\@a); # works and returns array to @B my $B = util_remove_duplicates(\@a); # works and returns array ref to +$B my $B = util_remove_duplicates(@a); # works without modifying @a


    my @a=qw(random brilliant braindead); print $a[rand(@a)];
Re: Removing Duplicates from Array Passed by Reference
by BrowserUk (Patriarch) on May 17, 2003 at 00:46 UTC

    From what I could see, none of the answers above actually address the problem of creating huge temporary lists when de-duping very large arrays.

    #! perl -slw use strict; sub dedup { my $ref = shift; my ($i, %h) = 0; while( $i < @$ref ) { my $v = $ref->[$i]; unless( exists $h{$v} ) { $h{$v}=undef; $i++; next; } splice @{$ref}, $i, 1; } } my @a = (7, 20..30, 1..1000, 200..300, 7); print scalar @a; dedup \@a; print scalar @a; #print "@a";

    This avoids the temporary lists by scanning the array, noting what it has seen and spliceing out any duplicates as it goes.

    If your data tends to have long runs of duplicates as shown above, you can add logic to save the splice until a run of dups stops and then splice out the run in a single hit which is somewhat more efficient at the cost of complexity.

    #! perl -slw use strict; sub dedup { my $ref = shift; my ($i, $count, %h) = (0, 0); while( $i < @$ref ) { my $v = $ref->[$i]; unless( exists $h{$v} ) { $h{$v}=undef; $i++; next unless $count; $i -= $count; splice @{$ref}, $i-1, $count; $count = 0; next; } $count++; $i++; } $i -= $count; splice @{$ref}, $i, $count; } my @a = (5, 1..2, 1..10, 2..5, 7); print scalar @a, ":@a"; dedup \@a; print scalar @a, ":@a";

    I can't help but think that some of the extra complexity and duplication could be factored out, but I haven't worked out how. (Yet:).


    Examine what is said, not who speaks.
    "Efficiency is intelligent laziness." -David Dunham
    "When I'm working on a problem, I never think about beauty. I think only how to solve the problem. But when I have finished, if the solution is not beautiful, I know it is wrong." -Richard Buckminster Fuller

      For this particular problem, I believe that we can manually do the work of splice more efficiently than splice could itself.

      # Untested :( sub dedup { my $ref = shift; my %seen; my $c = 0; # current my $n = 0; # next available slot foreach (@$ref) { if (not $seen{$_}++) { $ref->[$n] = $_ unless $n == $c; $n++; } $c++; } $#{$ref} = $n-1; }

      Every element gets moved once, at most. Elements before the first duplicate don't get moved at all.

        That's cool.

        From what I understand, which may well be wrong, splice doesn't actually move anything. It just unlink the deleted item or range from the chain and frees off the unlinked space? (Must re-read Perlguts Illustrated again sometime:).

        I'm wary of using foreach (@$_)... as I thought that it would expand the array to a list of aliases? I know that for( 0..n) {...} uses an iterator rather than a list, but I'm not sure about with an array. If that isn't the case, it would be good to know.

        It would be interesting to compare the performance (memory and speed). Maybe I'll have a go at it later.

        Update Perl wins again:)

        D:\Perl\test>258641 Rate util___big splice_big util___sml splice_sml util___big 14.0/s -- -45% -98% -99% splice_big 25.4/s 81% -- -97% -98% util___sml 876/s 6151% 3348% -- -41% splice_sml 1490/s 10529% 5764% 70% --

        I'll post the benchmark if anyone is interested.


        Examine what is said, not who speaks.
        "Efficiency is intelligent laziness." -David Dunham
        "When I'm working on a problem, I never think about beauty. I think only how to solve the problem. But when I have finished, if the solution is not beautiful, I know it is wrong." -Richard Buckminster Fuller
      We can do much better by avoiding splice!
      #! perl -slw use strict; sub dedup { my $ref = shift; my ($r, $l, %h); $r=$l=0; while( $r < @$ref ) { if ($h{$ref->[$r]}++) { # already seen ++$r; } else { # not yet seen. Copy! $ref->[$l++]= $ref->[$r++]; } }
          $#$ref= $l;
      $#$ref=$l-1 } my @a = (7, 20..30, 1..1000, 200..300, 7); print scalar @a; dedup \@a; print scalar @a; #print "@a";
      This works by moving with two indizes (r-ight and l-eft) over the array, copying the right to the left if it hasn't be seen yet. At the end the length of the array is adjusted.

        Um:), by what criteria ... much better?

        If you run the code from your post, but uncomment the last line, you'll see that your algorithm leaves a single duplicate at the end of the array.

        With the sample data in your post, this turns out to be the 989. Which is strange as this isn't a duplicate in the input array. That is probably easily fixed though. I had a similar problem with the first cut of my splice version.

        However, your algorithm is essentially the same as utils above at Re: Re: Removing Duplicates from Array Passed by Reference and as I showed in my reply at Re: Re: Re: Removing Duplicates from Array Passed by Reference, splice wins against the algorithm because splice doesn't have to copy anything, as it simply unlinks the discarded elements from the linked list that implements perls arrays. I added your version to my benchmark and ignoring the error, splice beats it easily.

        Here are the results

        Rate util___sml skeeve_sml splice_sml util___sml 1368/s -- -1% -11% skeeve_sml 1387/s 1% -- -10% splice_sml 1536/s 12% 11% -- Rate util___big skeeve_big splice_big util___big 17.1/s -- -12% -24% skeeve_big 19.4/s 13% -- -13% splice_big 22.4/s 31% 15% --

        As you can see, your implementation beat utils, but maybe once you've fixed the error that may no longer be true. Yell if you want the Benchmark code.


        Examine what is said, not who speaks.
        "Efficiency is intelligent laziness." -David Dunham
        "When I'm working on a problem, I never think about beauty. I think only how to solve the problem. But when I have finished, if the solution is not beautiful, I know it is wrong." -Richard Buckminster Fuller
Re: Removing Duplicates from Array Passed by Reference
by fletcher_the_dog (Friar) on May 16, 2003 at 16:00 UTC
    If you want to keep the order and edit in place, try this:
    sub util_remove_duplicates{ my $ref=shift; my %hash; @$ref=grep{!$hash{$_}++} @$ref; } my @array=qw(a b c a j k l m o p q k); print "@array\n"; util_remove_duplicates(\@array); print "@array\n";
    OUTPUT:
    a b c a j k l m o p q k a b c j k l m o p q