Ya has asked for the wisdom of the Perl Monks concerning the following question:


I have two arrays. I want to splice out specific entries from @arr1 which are present in @arr2.
How to do the same if I have dupliucate entries in @arr1
ex:
@arr1 = (1, 22, , 52, 25, 3, 6, 8, 9, 10); @arr2 = (25, 52, 8); ... #result I need is as follows @arr1 now should has (1, 22, , 3, 6, 9, 1 +0)

Currently I am using a foreach using arr2 but for large arrays(28000 & more entries it is taking long time.)
Thanks & Regards,
Ya

Replies are listed 'Best First'.
Re: How to splice out multiple entries from an array
by hossman (Prior) on Oct 24, 2002 at 07:28 UTC
    1. What you are looking for is called a "set difference"
    2. This is covered in perlfaq4
    3. if you run perldoc -q difference perldoc will search for FAQ entries that refer to that keyword
    4. Here is what the FAQ has to say...

      Found in /usr/share/perl/5.6.1/pod/perlfaq4.pod How do I compute the difference of two arrays? How do I compute the intersection of two arrays? Use a hash. Here's code to do both and more. It assumes that each element is unique in a given array: @union = @intersection = @difference = (); %count = (); foreach $element (@array1, @array2) { $count{$eleme +nt}++ } foreach $element (keys %count) { push @union, $element; push @{ $count{$element} > 1 ? \@intersection : + \@difference }, $element; } Note that this is the symmetric difference, that is, all elements in either A or in B but not in both. Think of it as an xor operation.

    5. "large inputs" tend to take a "long time" to process. Take that as a given in life -- something that's implied by the laws of physics

      The only real caveat, is that sometimes when dealing with large inputs, the nature of the problem lends itself to a solution in which you can get intermediate results as you go, without having to first see all of the results. (ie: find all lines of input tha match a certain pattern).

      Set operations do not fall into that category of problem -- you pretty much have to have both sets, in their entirety, available to you at the same time in order to find the difference (or intersection, or union)

      Actually, the above algorithm is not particularly suitable for the OP's needs. Its relevent as a indicator on what to do but it doesnt solve his problem in the most efficient way.

      If you read the code, and assume that @arr1 is large, without many duplicates, and that @arr2 is small without many duplicates either, then it would appear that the above will iterate over the values of @arr1 twice. Also the above has the problem that the resulting array will have no ordering relationship to the original one. I think a superior method is the one I posted in my orphaned reply. Or rather a minor variant that doesn use hash slices.

      my %remove; $remove{$_}++ for @arr2; @clean=grep !$remove{$_},@arr1;
      This code will only walk @arr1 and @arr2 once. Note that this is one of the weaknesses in 'big O notation'. Both of these solutions are O(N), however, the worst cases of both are 2(N+M) and N+M, thus the later has a much better worst case even though they are the same order of solution.

      BrowserUk posted a benchmark that I will mod to benchmark the various solutions, including the one above.

      Regards,

      --- demerphq
      my friends call me, usually because I'm late....

      Thanks for your reply.

      Best Regards,

      Ya
Re: How to splice out multiple entries from an array
by BrowserUk (Patriarch) on Oct 24, 2002 at 12:33 UTC

    If you need to generate the intersection between your array1 and another array more than once, then there is a way of speeding this process up by a huge factor (over 10000%!) relative to using the FAQ intersection method.

    Instead of generating a new hash from array's 1 & 2 each time you need to do this, you can use a %hash1 in place of @array1. In use, you would need to swap your array1 references from $array1[n] to $aryHash1{n};, you wouldn't be able to have duplicates in aryHash1, and it would use additional memory, but less than the combined space of an array and a temporary hash for doing the intersection.

    Benchmark

    By my calculation, this is an 1037/8.81*100 = 11770% faster! Did I make a mistake?


    Cor! Like yer ring! ... HALO dammit! ... 'Ave it yer way! Hal-lo, Mister la-de-da. ... Like yer ring!
      Hmm, its a little hard to benchmark your approach against the others. (Thats why you used Benchmark::Timer right?) Anyway, I tried a couple of interpretations. Considering that all of the others had initialization overhead involved i decided that your approach needed it as well. I realize that this is not enitrely fair, considering that your approach would not incur this overhead under some designs. However the same argument applies to the other solutions to a certain degree. The auxiliary hashes used by the faq solution could be precomputed when reading the array for example.

      Despite the difficulties I gave it a crack (using two different ways to set the hash based solution up) and this is what I came up with

      __END__ Benchmark: timing 100 iterations of faq, grepped, hash, hash_lret, pre +hash, prehash_lret... faq: 35 wallclock secs (34.09 usr + 0.02 sys = 34.11 CPU) @ 2 +.93/s (n=100) grepped: 11 wallclock secs ( 9.91 usr + 0.00 sys = 9.91 CPU) @ 10 +.09/s (n=100) hash: 10 wallclock secs (10.27 usr + 0.00 sys = 10.27 CPU) @ 9 +.74/s (n=100) hash_lret: 10 wallclock secs (10.23 usr + 0.00 sys = 10.23 CPU) @ 9 +.77/s (n=100) prehash: 16 wallclock secs (15.13 usr + 0.00 sys = 15.13 CPU) @ 6 +.61/s (n=100) prehash_lret: 16 wallclock secs (15.17 usr + 0.00 sys = 15.17 CPU) @ + 6.59/s (n=100) Rate faq prehash_lret prehash hash hash_lre +t grepped faq 2.93/s -- -56% -56% -70% -70 +% -71% prehash_lret 6.59/s 125% -- -0% -32% -33 +% -35% prehash 6.61/s 126% 0% -- -32% -32 +% -35% hash 9.74/s 232% 48% 47% -- -0 +% -4% hash_lret 9.77/s 233% 48% 48% 0% - +- -3% grepped 10.1/s 244% 53% 53% 4% 3 +% --
      the _lret results are the ones that actually return a list. And the prehash vs hash has to do with how the variables are initialized. The results I think are suprising.

      Incidentally unless Benchmark::Timer is timing _long_ events it would lose accuracy, perhaps considerably. On most boxes the resolution is around 1/100th of a second. That is why Benchmark is written the way it is.

      This is one of those things that makes benchmarking different abstract solutions against each other very difficult. Ultimately the only real way to benchmark is to use a live solution and time it against a different design under the same enviornment.

      --- demerphq
      my friends call me, usually because I'm late....

        Thats why you used Benchmark::Timer right?

        The reason I used Benchmark::Timer is because it allowed me to time the deletion process, rather than incorporating the overhead of rebuilding the data for each pass. This rebuilding process is only required, because what we are trying to measure is a destructive process and we are repeating it many times for the purposes of benchmarking. It would not be a part of a real application, so should not be allowed to influence the timing of the code under test.

        You may consider that by incorporating the rebuilding process into all the tests makes for a valid comparison, but it doesn't. When the code required ONLY for the purposes of testing takes orders of magnitude longer, than the code you are actually trying to measure, it has the effect of swamping the data you are interested in to the point where the information gleaned is useless.

        Given algorithm A taking 1.1 time units and algorithm B taking 0.5 time units and a test setup overhead of 100 time units.

        ( (100 + 1.1) / (100 + 0.5) ) * 100 = 100.597% This makes it look like Algorithm B is 0.6% better than the Algorithm A.

        Now remove the testing overhead from the equation.

        ( ( 1.1 ) / (0.5) ) * 100 = 220% Algorith A is 120% slower than Algorithm B, Or.

        ( ( 0.5) / (1.1) ) * 100 = 45% Algorithm B takes less than half the time that Algorithm A took.

        It is very difficult to prevent this inclusion of artificial overhead in the timings using Benchmark::cmpthese.

        However, I have achieved this by using a fresh, preallocated lookup array or hash for each iteration of the benchmark, therebye removing the need to incorporate the re-building of the destructively modified data into the benchmark timings. Hopefully, this test methodology will meet with your approval.

        Considering that all of the others had initialization overhead involved i decided that your approach needed it as well. I realize that this is not enitrely fair, considering that your approach would not incur this overhead under some designs. However the same argument applies to the other solutions to a certain degree.

        This is not just 'not enitrely fair', its wrong. As stated in the post with my original benchmark, if the application is only using the array as a lookup table, and doesn't have the need to access the array elements chronologically by index, then using a hash inplace of the array (as opposed to as well as the array) facilitates the use of a hash slice and the delete function to do the intersection. This removes any overhead associated with building a temporary hash to achieve the intersection, and means that the entire process of iteration over the exclusion set can be performed in machine code rather than at application level. It also removes the need to regenerate the lookup table as the delete function modifies the hash directly. With both the other methods, the building of the temporary hash is fundemental to their function, as is iteration over the lookup table at application level.

        It is the very removal of these costly factors, and performing the intersection in an atomic (from the POV of the application) at machine code level that is responsible for the huge hike in efficiency.

        This is exactly what the OP was asking for. If the OP's application did not fit these requirements, then the the technique would not be valid for his purpose, but as we do not have any knowledge of those requirements, I offered the benchmark so that he could make that judgment for himself.

        The problem with your benchmark is that as well as incorporating the testing overhead as described above, you imposed the generation of a new hash every time. This wouldn't be necessary as the hash would always exists. In fact it would be the only place where the lookup table would be stored. If you re-read my post you'll see that I say that "%hash1 in place of @array1" In a second version you then also imposed the generation of an array from the keys of the hash. Why? This would never be used, so there is no need to generate it.

        Given a large list of values to be compared against, and an array of things to look up, using the other methods requires two steps to complete the process.

        1. Build a hash from the array of things to look up.
        2. Filter the list against the hash.

        The difference is exemplified by the case where a second array of values needs to be excluded from the big list. Using the hash&grep method, this second array needs to be hashed. Using delete, the array can be used directly. The overhead of building the temporary hash is an inherent part of the hash and grep algorithm. By maintaining the lookup table in a hash instead of an array, it facilitates using delete on a hash slice with the added bonus that the construction of the temporary hash is unnecessary.

        Of course, this method will not work if other parts of the program need to access the individual values of either the pre- or post-removal lists in their original ordering. As was discused elsewhere recently, the need to access arrays by index is actually a fairly rare occurrence in perl. This is even more true as the indexes will have change after the deletion anyway. Although the relative ordering will remain.

        Using Benchmark::cmpthese to compare the two algorithms (I dropped the FAQ version as we both agree that it is wildly inefficient, serving only as an aid to understanding), having isolated the code performing the intersection from code only used to facilitate testing, I arrived at these timing.

        Rate grepmethod deletemethod grepmethod 1.99/s -- -99% deletemethod 270/s 13432% --

        The full output is shown at the bottom of the test program below, and you will notice that the testing of the delete method is reported as not having been run enough times for accurate measurement. The method is so efficient that it is impossible on my box with 256MB ram to create enough test data to allow it to run enough times to prevent this warning. Whilst this may allow the 13000% better performance figure to be challenged, the fact that it is at least 2 orders of magnitude more efficient cannot be. I've now benchmarked this four different ways and the results are always the same, at least 10,000% or 2-orders of magnitude quicker.

        Code and full benchmark results.


        Nah! Your thinking of Simon Templar, originally played by Roger Moore and later by Ian Ogilvy
Re: How to splice out multiple entries from an array
by demerphq (Chancellor) on Oct 24, 2002 at 08:49 UTC
    This is a FAQ. The basic solution is to put the elements @arr2 in a hash and then use grep. (If you can arrange that @arr2 is a hash from the get go then you are one up on this.)
    my %hash; @hash{@arr2}=(1) x @arr2; my @cleaned=grep !$hash{$_},@arr1;
    The running time of this is N+M, ie the number of elements in both arrays summed together.

    Here is the relevant FAQ (in fact it explains how to do array union, intersection, and difference, the above only does the difference.)

    In future please try doing some rudimentary searching before asking a question. Especially for something simple like this. It almost certainly has been asked and answered time and again.

    HTH

    --- demerphq
    The trick is to use Perls strengths, not its weaknesses -- Larry Wall

Re: How to splice out multiple entries from an array
by gjb (Vicar) on Oct 24, 2002 at 08:26 UTC

    If the order of the list is not important, you can use Set::Scalar to do the job.

    use Set::Scalar; my @arr1 = (1, 22, 3, 52, 25, 3, 6, 8, 9, 10); my @arr2 = (25, 52, 8); my $set = new Set::Scalar(@arr1); @arra1 = $set->delete(@array)->members();

    A solution for the truly lazy, hope this helps, -gjb-

Re: How to splice out multiple entries from an array
by hotshot (Prior) on Oct 24, 2002 at 07:16 UTC
    myabe you should use a third array and just push to it all the entries of arr1 as long as they don't appear in arr2 (you can use a hash for that purpose). It's n complexity at the most.

    Hotshot