BrowserUk has asked for the wisdom of the Perl Monks concerning the following question:

Given two arrays (with duplicates), one a proper subset of the other, produce a third array containing those in the first not in the second. Eg.

my @a = (43,43,44); my @b = (43,43); ## Required result my @c = (44); my @p = ( 1,1,1,1,1,2,2,2,3,3,4,5,6); my @q = ( 1,2,3,4,5,6 ); ## Required result my @r = ( 1,1,1,1,2,2,3 );

The FAQ solution produces both intersection and difference, and more than doubles memory usage in the process.

List::Compare more than quadruples memory usage on it's way to calculating about fifty things I don't need as well as the one I do.

Most of the other CPAN and Categorised answer solutions don't handle duplicates, use large amounts of memory, or both.

Is there a simple variation on a using a hash and grep that works? Cos I'm not seeing it written anywhere and its eluding me.


Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.
"Too many [] have been sedated by an oppressive environment of political correctness and risk aversion."

Replies are listed 'Best First'.
Re: Difference arrays.
by betterworld (Curate) on Sep 04, 2008 at 19:30 UTC
    my @p = (1,1,1,1,1,2,2,2,3,3,4,5,6); my @q = (1,2,3,4,5,6 ); # my @p = (43, 43, 44); # my @q = (43, 43); my %q; $q{$_}++ for @q; my @r = grep { --$q{$_} < 0; } @p; print join (',', @r), "\n";

      Similar idea like mine above , but much, much better than I did it! ++ and ++ could I vote twice!


      s$$([},&%#}/&/]+}%&{})*;#$&&s&&$^X.($'^"%]=\&(|?*{%
      +.+=%;.#_}\&"^"-+%*).}%:##%}={~=~:.")&e&&s""`$''`"e

      Big bug

      Just found it when I was examining kyle's solution.

      swap p and q and your routine will fail!

      Sorry! Not a bug! My mistake!


      s$$([},&%#}/&/]+}%&{})*;#$&&s&&$^X.($'^"%]=\&(|?*{%
      +.+=%;.#_}\&"^"-+%*).}%:##%}={~=~:.")&e&&s""`$''`"e

        I noticed that your solution is symmetric because it has the abs in it. However I was taking BrowserUk by the word: "those in the first not in the second".

        Given the restriction "one a proper subset of the other", it does not matter, and our solutions print essentially the same elements (maybe in a different order).

Re: Difference arrays.
by moritz (Cardinal) on Sep 04, 2008 at 19:29 UTC
    Here's a not-quite-as-simple version that doesn't use a hash at all, and as little additional memory as possible. It exploits both the fact that one array is a proper subset of the other, and that the items appear in the same order in both arrays.

    If the latter is not the case you'd have to sort it, which kinda defeats the memory advantage.

    use strict; use warnings; my @p = ( 1,1,1,1,1,2,2,2,3,3,4,5,6); my @q = ( 1,2,3,4,5,6 ); my ($px, $qx) = (0, 0); my @diff; while (1) { if ($qx >= @q){ push @diff, @p[$px .. @p-1]; last; } elsif ( $p[$px] == $q[$qx] ) { $px++; $qx++; } else { push @diff, $p[$px++]; } } print "p: @p\n"; print "q: @q\n"; print "d: @diff\n";

    Update: that code can be simplified a bit:

    while ($qx < @q) { if ( $p[$px] == $q[$qx] ) { $px++; $qx++; } else { push @diff, $p[$px++]; } } push @diff, @p[$px .. @p-1];
Re: Difference arrays.
by ikegami (Patriarch) on Sep 04, 2008 at 19:58 UTC

    Since you're concerned about memory, you could do something like a Merge Sort.
    Memory: O(1) (Not counting @a, @b and @c)
    Speed: O(A+B) (Assuming @a and @b already sorted. As good as the other solutions)

    my @a = sort { $a <=> $b } (43,43,44); my @b = sort { $a <=> $b } (43,43); my @c; while (@a && @b) { if ($a[0] < $b[0]) { push @c, shift @a; } elsif ($a[0] > $b[0]) { die "Bad data"; } else { shift @a; shift @b; } } push @c, $_ for @a; die "Bad data" if @b;

    A trivial change makes it non-destructive.

    Update: Fixed bug mentioned in replies. Tested.

      That works better if the pops are shifts. .oO(How many times have I been caught by that!)


      Perl reduces RSI - it saves typing

      When I run this, I end up with...

      @a = ( 43 ); @b = (); @c = ( 43 );

      None of those is the OP's desired result, ( 44 ). Am I missing something?

Re: Difference arrays.
by kyle (Abbot) on Sep 04, 2008 at 19:52 UTC

    Fun with DDT!

    My own solution pulled out of the <readmore>:

    sub kyle { my ( $ref1, $ref2 ) = @_; my %h; $h{$_}++ for @{$ref1}; $h{$_}-- for @{$ref2}; my %x; return [ grep { $x{$_}++ < $h{$_} } @{$ref1} ]; }

    I notice that Skeeve's has the bug mine had before I tested it.

    I'm happily surprised at how many don't have the bug I was expecting when I wrote the [ $r1, ... ] test.

    I, for one, wasn't shooting to optimize memory usage. I just wrote the first thing I thought of.

      Okay… Mine has the bug that it stringifies the keys. agreed.

      yours has a bug too! It doesn't work in this case:

      kyle( \@q, \@p );

      No! Mine has the bug in that it doesn't care for what BrowserUK asked for!


      s$$([},&%#}/&/]+}%&{})*;#$&&s&&$^X.($'^"%]=\&(|?*{%
      +.+=%;.#_}\&"^"-+%*).}%:##%}={~=~:.")&e&&s""`$''`"e

        (OK, so as I was writing this, the node I'm replying to was updated. I'll post anyway...)

        I don't know what you mean. You've passed in the input arrays in the opposite order?

        The OP specifies that the second array is to be a proper subset of the first. If you reverse them, that violates this condition. So what would you expect to get back in that case? Mine returns (a reference to) an empty array. The solution from betterworld gets the same thing. The two solutions that moritz posted go into an infinite loop (apparently—I didn't exactly wait that long). The only other working solution, pjotrik's, gets something else.

        Anyway, here's an updated test script...

Re: Difference arrays.
by Skeeve (Parson) on Sep 04, 2008 at 19:28 UTC

    Since it's you, here is my suggestion:

    Update: My "solution" is now shamefully hidden in readmore-tags because it doesn't give you what was asked for. My solution simply gives you (stringified) all elements whch are in the first or the second array but not in both.

    Wouldn't it be you, I would have cried H O M E W O R K! ;-)

    Update: Removed an overseen "my $d". Thanks to betterworld


    s$$([},&%#}/&/]+}%&{})*;#$&&s&&$^X.($'^"%]=\&(|?*{%
    +.+=%;.#_}\&"^"-+%*).}%:##%}={~=~:.")&e&&s""`$''`"e
      "Examine what is said, not who speaks"

        "Examine what is said, not who speaks"

        Hey, I think I've seen that in someone's signature! Um, anyway...

        I can't speak for the other monks, but I've found content analysis to be expensive enough to be worth optimizing. As such, it's expedient to apply a simple memoization/caching technique whereby the source of some assertion influences the time and effort I spend evaluating the assertion based on past experiences with that source.

        It's true that a stopped clock is right twice every day, and even kooks can have insights worthy of deep consideration. Still, I don't want to waste my time looking at a clock I already know is broken any more than I want to waste my time pondering the rantings of someone whose rantings I've already pondered at a length greater than their value.

        I'm not talking about anyone in particular here. I just find the "ignore the source" meme a little irritating.

        Heuristics have their place.

Re: Difference arrays.
by pjotrik (Friar) on Sep 04, 2008 at 20:22 UTC
    Given the (very strict) restrictions (proper subset, sorted) , my solution would be:
    sub pjotrik { my ($a, $b) = @_; my $i = 0; return [ map { if ($i < @$b && $_ == $$b[$i]) { $i++; () } else { +$_ } } @$a ]; }
    But the use of == makes it somewhat vulnerable. ~~ should improve that, but I have no experience with it.

    UPDATE: Note that as well as ikegami's solution, this is based on the idea of merging.

Re: Difference arrays.
by pat_mc (Pilgrim) on Sep 04, 2008 at 22:34 UTC
    Not sure if there is much point in submitting yet another solution. Still, here's my take on things:
    my @a = ( 1,1,1,1,1,2,2,2,3,3,4,5,6); my @b = ( 1,2,3,4,5,6 ); my %d; my @difference; $d{ $_ } ++ for @a; $d{ $_ } -- for @b; for my $key ( keys %d ) { next if ( $d{ $key } <= 0 ); push @difference, $key for ( 1 .. $d{ $key } ); } print sort @difference;


    It is perspicuous and works. Comments?

    Pat
Re: Difference arrays.
by dreadpiratepeter (Priest) on Sep 04, 2008 at 19:18 UTC
    UPDATE: scratch that, didn't read close enough and look at the second example, my bad.
    wouldn't (off the top of my head):
    my @a = (43,43,44); my @b = (43,43); my %h = map {($_=>1)} @a; delete @h{@b}; print join(",",keys %h);
    work?


    -pete
    "Worry is like a rocking chair. It gives you something to do, but it doesn't get you anywhere."

      work? No!

      Try it with the other example of the question.


      s$$([},&%#}/&/]+}%&{})*;#$&&s&&$^X.($'^"%]=\&(|?*{%
      +.+=%;.#_}\&"^"-+%*).}%:##%}={~=~:.")&e&&s""`$''`"e
Re: Difference arrays.
by repellent (Priest) on Sep 05, 2008 at 02:38 UTC
    Perhaps I'm missing something, but here's what I got:
    my @a = ( 1,1,1,1,1,2,2,2,3,3,4,5,6); my @b = ( 1,2,3,4,5,6 ); my @c = map { my $found = 0; my $m = $_; for (1 .. @b) { my $n = shift @b; ++$found and last if $m == $n; push @b, $n; } $found ? () : $m; } @a;

    Update: OK, I think I got it to work. No hash here, so it may save more space at the expense of time.
Re: Difference arrays.
by mr_mischief (Monsignor) on Sep 05, 2008 at 05:36 UTC
    This should trade lower memory usage for potentially longer run times before accounting for swapping effects.

    This version assumes one can modify @a for even a bit more memory savings:

    my @a = ( 42, 42, 43, 43, 43, 44, 45, 46 ); my @b = ( 43, 45 ); for my $i ( 0 .. $#b ) { for my $j ( 0 .. $#a ) { next unless defined $a[ $j ]; $a[ $j ] = undef, last if $a[ $j ] == $b[ $i ]; } } @a = grep { defined } @a; print join ', ', @a; print "\n";
    outputs: 42, 42, 43, 43, 44, 46

    This obviously trivial modification is more conservative and assumes one can't modify @a:

    my @a = ( 42, 42, 43, 43, 43, 44, 45, 46 ); my @b = ( 43, 45 ); my @c = @a; for my $i ( 0 .. $#b ) { for my $j ( 0 .. $#c ) { next unless defined $c[ $j ]; $c[ $j ] = undef, last if $c[ $j ] == $b[ $i ]; } } @c = grep { defined } @c; print join ', ', @c; print "\n";

    The second outputs the same as the first. Neither cares if the arrays are presorted, because it's O(m*n) and checking each against each already.

      In-place sorting would be much faster [ O(N log N) instead of O(N2) ] and can be be written to use O(1) extra memory. However, it destroys the original arrays.

      Also, you're wasting memory by placing @c on the stack. You could drop your memory usage from O(N) to O(1) by compressing @c in-place. grep doesn't work in-place like sort when the source and destination is the same.

      Update: Added downside.

        I really just posted it as an interesting alternative. The method of marking the array directly was the main focus. I already said it'd run more slowly than some others.

        It's actually not bad where the subset is 32 or so items or fewer, or if @a has lots of duplicates that happen to be in @b. It doesn't slow down from function calls in the tightly wound sections.

        The grep is the biggest memory concern, and that's an implementation detail of the language. The original post asked for grep and a hash. I offered an array instead of a hash. That should save some memory by itself. I could splice @c (or @a) in the foreach, but perlsyn specifically forbids that. I could pop off each element and push it back on only if it's defined. That seems like a lot of work in response to a request of a simple solution which could include grep, and I'm sure BrowserUK could figure that part out anyway. Mine's already not the easiest here to understand.

        If the memory use issue is due to thousands of small arrays quadrupling in size, then my solution could be useful. If the problem is that the actual production arrays are huge or that the subset arrays are fairly long, then it won't be.

        That solution can also pretty easily be altered so that by sorting only @b any duplicates within @b do not cause a loop through @a again. It's not a giant optimization, but it could slow the growth substantially if the typical data set has lots of duplicates in the subset array. I have no idea how prevalent duplicates within that array actually are.

        my @a = ( 42, 42, 43, 43, 43, 44, 45, 46, 41, -13 ); my @b = ( 43, 45, -13, 43 ); my @c = @a; @b = sort @b; for my $i ( 0 .. $#b ) { next if $i > 0 && $b[ $i ] == $b[ $i - 1]; for my $j ( 0 .. $#c ) { next unless defined $c[ $j ]; $c[ $j ] = undef, last if $c[ $j ] == $b[ $i ]; } } @c = grep { defined } @c; print join ', ', @c; print "\n";

        BTW, why do you use a for loop to push the elements of @a onto @c? Why not push @c, @a; instead? Is that a memory optimization peculiar to how perl handles push with a list or array argument internally?

Re: Difference arrays.
by Anonymous Monk on Sep 05, 2008 at 05:16 UTC