in reply to Removing Duplicates from Array Passed by Reference

From what I could see, none of the answers above actually address the problem of creating huge temporary lists when de-duping very large arrays.

#! perl -slw use strict; sub dedup { my $ref = shift; my ($i, %h) = 0; while( $i < @$ref ) { my $v = $ref->[$i]; unless( exists $h{$v} ) { $h{$v}=undef; $i++; next; } splice @{$ref}, $i, 1; } } my @a = (7, 20..30, 1..1000, 200..300, 7); print scalar @a; dedup \@a; print scalar @a; #print "@a";

This avoids the temporary lists by scanning the array, noting what it has seen and spliceing out any duplicates as it goes.

If your data tends to have long runs of duplicates as shown above, you can add logic to save the splice until a run of dups stops and then splice out the run in a single hit which is somewhat more efficient at the cost of complexity.

#! perl -slw use strict; sub dedup { my $ref = shift; my ($i, $count, %h) = (0, 0); while( $i < @$ref ) { my $v = $ref->[$i]; unless( exists $h{$v} ) { $h{$v}=undef; $i++; next unless $count; $i -= $count; splice @{$ref}, $i-1, $count; $count = 0; next; } $count++; $i++; } $i -= $count; splice @{$ref}, $i, $count; } my @a = (5, 1..2, 1..10, 2..5, 7); print scalar @a, ":@a"; dedup \@a; print scalar @a, ":@a";

I can't help but think that some of the extra complexity and duplication could be factored out, but I haven't worked out how. (Yet:).


Examine what is said, not who speaks.
"Efficiency is intelligent laziness." -David Dunham
"When I'm working on a problem, I never think about beauty. I think only how to solve the problem. But when I have finished, if the solution is not beautiful, I know it is wrong." -Richard Buckminster Fuller

Replies are listed 'Best First'.
Re: Re: Removing Duplicates from Array Passed by Reference
by Util (Priest) on May 17, 2003 at 03:24 UTC

    For this particular problem, I believe that we can manually do the work of splice more efficiently than splice could itself.

    # Untested :( sub dedup { my $ref = shift; my %seen; my $c = 0; # current my $n = 0; # next available slot foreach (@$ref) { if (not $seen{$_}++) { $ref->[$n] = $_ unless $n == $c; $n++; } $c++; } $#{$ref} = $n-1; }

    Every element gets moved once, at most. Elements before the first duplicate don't get moved at all.

      That's cool.

      From what I understand, which may well be wrong, splice doesn't actually move anything. It just unlink the deleted item or range from the chain and frees off the unlinked space? (Must re-read Perlguts Illustrated again sometime:).

      I'm wary of using foreach (@$_)... as I thought that it would expand the array to a list of aliases? I know that for( 0..n) {...} uses an iterator rather than a list, but I'm not sure about with an array. If that isn't the case, it would be good to know.

      It would be interesting to compare the performance (memory and speed). Maybe I'll have a go at it later.

      Update Perl wins again:)

      D:\Perl\test>258641 Rate util___big splice_big util___sml splice_sml util___big 14.0/s -- -45% -98% -99% splice_big 25.4/s 81% -- -97% -98% util___sml 876/s 6151% 3348% -- -41% splice_sml 1490/s 10529% 5764% 70% --

      I'll post the benchmark if anyone is interested.


      Examine what is said, not who speaks.
      "Efficiency is intelligent laziness." -David Dunham
      "When I'm working on a problem, I never think about beauty. I think only how to solve the problem. But when I have finished, if the solution is not beautiful, I know it is wrong." -Richard Buckminster Fuller
Re: Re: Removing Duplicates from Array Passed by Reference
by Skeeve (Parson) on May 19, 2003 at 08:32 UTC
    We can do much better by avoiding splice!
    #! perl -slw use strict; sub dedup { my $ref = shift; my ($r, $l, %h); $r=$l=0; while( $r < @$ref ) { if ($h{$ref->[$r]}++) { # already seen ++$r; } else { # not yet seen. Copy! $ref->[$l++]= $ref->[$r++]; } }
        $#$ref= $l;
    $#$ref=$l-1 } my @a = (7, 20..30, 1..1000, 200..300, 7); print scalar @a; dedup \@a; print scalar @a; #print "@a";
    This works by moving with two indizes (r-ight and l-eft) over the array, copying the right to the left if it hasn't be seen yet. At the end the length of the array is adjusted.

      Um:), by what criteria ... much better?

      If you run the code from your post, but uncomment the last line, you'll see that your algorithm leaves a single duplicate at the end of the array.

      With the sample data in your post, this turns out to be the 989. Which is strange as this isn't a duplicate in the input array. That is probably easily fixed though. I had a similar problem with the first cut of my splice version.

      However, your algorithm is essentially the same as utils above at Re: Re: Removing Duplicates from Array Passed by Reference and as I showed in my reply at Re: Re: Re: Removing Duplicates from Array Passed by Reference, splice wins against the algorithm because splice doesn't have to copy anything, as it simply unlinks the discarded elements from the linked list that implements perls arrays. I added your version to my benchmark and ignoring the error, splice beats it easily.

      Here are the results

      Rate util___sml skeeve_sml splice_sml util___sml 1368/s -- -1% -11% skeeve_sml 1387/s 1% -- -10% splice_sml 1536/s 12% 11% -- Rate util___big skeeve_big splice_big util___big 17.1/s -- -12% -24% skeeve_big 19.4/s 13% -- -13% splice_big 22.4/s 31% 15% --

      As you can see, your implementation beat utils, but maybe once you've fixed the error that may no longer be true. Yell if you want the Benchmark code.


      Examine what is said, not who speaks.
      "Efficiency is intelligent laziness." -David Dunham
      "When I'm working on a problem, I never think about beauty. I think only how to solve the problem. But when I have finished, if the solution is not beautiful, I know it is wrong." -Richard Buckminster Fuller
        :-) Much better in the sence of not using splice. IIRC I once read in the camel book that one should avoid it.

        Nevertheless You're right that mine is much like util's solution. Didn't notice that.

        And then I made the mistake of setting the array one element too big...

        Bear with me... At least it's Monday ;-)