in reply to Removing certain elements from an array

my @array; getreallyhugelist( \@array ); my @idxsToRemove= getnotashugelist(); my %toRemoveIdx; @toRemoveIdx{@idxsToRemove}= (1)x@idxsToRemove; @array= map { $toRemoveIdx{$_} ? () : $array[$_] } 0..$#array;

That last line copies the whole array once (puts it on the stack) and then only has to shift each non-removed element once. So it will probably be faster than repeated calls to splice() (each one having to shift the whole end of the array). I think this speed-up will more than compensate for the cost of building the hash (on the preceding line) if you are removing more than a few elements.

        - tye (but my friends call me "Tye")

Replies are listed 'Best First'.
RE: Re: Removing certain elements from an array
by mikfire (Deacon) on Aug 24, 2000 at 22:03 UTC
    There is a memory tickling the back of my brain that disagrees with you. As I understand/remember it, perl implements arrays as linked lists. splice'ing nodes should not "shift the whole end of the array". This vague memory says perl's lists were explicitly designed to make this kind of stuff work well.

    My opinion of Benchmark is very low - the results I generated seems to support this vauge memory of mine.

    #!/os/dist/bin/perl -w use Benchmark; my @array = 0..10000; my @idxsToRemove = (); push @idxsToRemove, int( rand( 10000 ) ) for ( 0 .. 100 ); my %toRemoveIdx; @toRemoveIdx{@idxsToRemove} = (1)x@idxsToRemove; sub Tye { my @subarray = @array; @subarray = map { $toRemoveIdx{$_} ? () : $subarray[$_] } 0..$#sub +array; } sub btrott { my @subarray = @array; splice @subarray, $_, 1 for sort { $b <=> $a } @idxsToRemove; } imethese( shift @ARGV || -5, { 'Tye' => \&Tye, 'btrott' => \&btrott, } );

    Generated these results:

    Benchmark: running Tye, btrott, mikfire, each for at least 5 CPU seco +nds... Tye: 7 wallclock secs ( 5.98 usr + 0.00 sys = 5.98 CPU) @ 7 +.19/s (n=43) btrott: 6 wallclock secs ( 5.38 usr + 0.00 sys = 5.38 CPU) @ 63 +.01/s (n=339)

    mikfire

      Note that Perl arrays are most definitely not linked lists. They are neat, however.

      What you may be remembering is that pulling things off of the front of an array is nearly as efficient as popping them off of the back of an array. This is because Perl doesn't bother to shift all of the elements down until the gap in front gets big enough. But removing items from the middle of an array requires that one end (or the other?) be shifted.

      The drastic difference in benchmark times confuses me. Update: No, "sub btrott" isn't broken (my debugging was). Hmmm....

      Update 2: Never underestimate the slowness of Perl's opcode machinery. The "map" version loses because it executes 10_000 statements while the "splice" version only executes 100. While each "splice" statement has to copy a big chunk of the 10_000-element array, Perl's overhead swamps that. With the number of elements to be removed raised to 5_000, the "map" version is faster.

              - tye (but my friends call me "Tye")
      Nope. Arrays are not "linked lists". They do not "splice" well, but they do "shift" and "pop" quickly. And they can "unshift" and "push" quickly if the spare (or leftover) memory happens to be in the right place, otherwise, they must be copied.

      -- Randal L. Schwartz, Perl hacker

        That will teach me to trust my memory :)

        mikfire

      Your post illustrates the need to benchmark code carefully! The map solutions are better algorithmically, meaning that as you increase the size of the array and the number of elements removed, btrott's solution will slow down much more than they do. But in the real world the worse algorithm may still work out to be better.

      In this example, clearly that is the case. If you want to play around with it, here is some sample code to mess around with:

      use Benchmark; my $n = 10000; my $m = 100; my @array = 0..$n; my @idxsToRemove = map {int( rand( 10000 ) ) } 1..$m; my %tests = ( grepping => sub { my @subarray = @array; my %rem; @toRemoveIdx{@idxsToRemove} = (1)x@idxsToRemove; @subarray = @subarray[ grep !$toRemoveIdx{$_}, 0..$#subarray ]; }, mapping => sub { my @subarray = @array; my %rem; @toRemoveIdx{@idxsToRemove} = (1)x@idxsToRemove; @subarray = map { $toRemoveIdx{$_} ? () : $subarray[$_] } 0..$#s +ubarray; }, sorting => sub { my @subarray = @array; splice @subarray, $_, 1 for sort { $b <=> $a } @idxsToRemove; @subarray; }, ); timethese( shift @ARGV || -5, \%tests );
      Play with it. As you move $m and $n up you will find the first two solutions scaling better. But unless you have some truly impressive numbers, the sort solution by btrott will be faster.

      (Yes, I know the test could be improved a lot...)

      NOTE: The sorting solution is both faster and incorrect. If your list of elements to remove has duplicates, you incorrectly remove the same element multiple times! This illustrates my real reason for liking map, I find it easier for me to figure out and validate all possible gotchas using it. Even if it is sometimes a lot slower. :-)

      EDIT
      A typo change, noticed that I changed code without touching the description of what the code did. Oops.