in reply to Re: Difference arrays.
in thread Difference arrays.

In-place sorting would be much faster [ O(N log N) instead of O(N2) ] and can be be written to use O(1) extra memory. However, it destroys the original arrays.

Also, you're wasting memory by placing @c on the stack. You could drop your memory usage from O(N) to O(1) by compressing @c in-place. grep doesn't work in-place like sort when the source and destination is the same.

Update: Added downside.

Replies are listed 'Best First'.
Re^3: Difference arrays.
by mr_mischief (Monsignor) on Sep 05, 2008 at 09:48 UTC
    I really just posted it as an interesting alternative. The method of marking the array directly was the main focus. I already said it'd run more slowly than some others.

    It's actually not bad where the subset is 32 or so items or fewer, or if @a has lots of duplicates that happen to be in @b. It doesn't slow down from function calls in the tightly wound sections.

    The grep is the biggest memory concern, and that's an implementation detail of the language. The original post asked for grep and a hash. I offered an array instead of a hash. That should save some memory by itself. I could splice @c (or @a) in the foreach, but perlsyn specifically forbids that. I could pop off each element and push it back on only if it's defined. That seems like a lot of work in response to a request of a simple solution which could include grep, and I'm sure BrowserUK could figure that part out anyway. Mine's already not the easiest here to understand.

    If the memory use issue is due to thousands of small arrays quadrupling in size, then my solution could be useful. If the problem is that the actual production arrays are huge or that the subset arrays are fairly long, then it won't be.

    That solution can also pretty easily be altered so that by sorting only @b any duplicates within @b do not cause a loop through @a again. It's not a giant optimization, but it could slow the growth substantially if the typical data set has lots of duplicates in the subset array. I have no idea how prevalent duplicates within that array actually are.

    my @a = ( 42, 42, 43, 43, 43, 44, 45, 46, 41, -13 ); my @b = ( 43, 45, -13, 43 ); my @c = @a; @b = sort @b; for my $i ( 0 .. $#b ) { next if $i > 0 && $b[ $i ] == $b[ $i - 1]; for my $j ( 0 .. $#c ) { next unless defined $c[ $j ]; $c[ $j ] = undef, last if $c[ $j ] == $b[ $i ]; } } @c = grep { defined } @c; print join ', ', @c; print "\n";

    BTW, why do you use a for loop to push the elements of @a onto @c? Why not push @c, @a; instead? Is that a memory optimization peculiar to how perl handles push with a list or array argument internally?

      If the memory use issue is due to thousands of small arrays quadrupling in size, then my solution could be useful.

      My point was that it's only useful if it's required for the original arrays to remain unchanged.

      I could splice @c (or @a) in the foreach,

      That would be extremely slow [ probably O(N2) ] compared to the following [ O(N) ]:

      # @c = grep !defined, @c; # is O(N) speed, O(N) memory. # # The following does the same, but # is O(N) speed, O(1) memory. my $last = -1; my $i = -1; while (++$i < @c) { if (defined($c[$i]) { $c[++$last] = $c[$i]; } } $#c = $last;

      If the last line isn't O(1) memory, replace it with

      pop @c while $#c > $last;

      BTW, why do you use a for loop to push the elements of @a onto @c? Why not push @c, @a; instead?

      push uses O(N) memory (where N is the number of elements pushed).
      for ARRAY (as opposed to for LIST) is optimized to use O(1) memory, I think.