Re: Re: How to splice out multiple entries from an array

Hmm, its a little hard to benchmark your approach against the others. (Thats why you used Benchmark::Timer right?) Anyway, I tried a couple of interpretations. Considering that all of the others had initialization overhead involved i decided that your approach needed it as well. I realize that this is not enitrely fair, considering that your approach would not incur this overhead under some designs. However the same argument applies to the other solutions to a certain degree. The auxiliary hashes used by the faq solution could be precomputed when reading the array for example.

Despite the difficulties I gave it a crack (using two different ways to set the hash based solution up) and this is what I came up with

#! perl -w
use strict;
use Benchmark qw(cmpthese);

sub intersect (\@\@) {
    my ($aryRef1, $aryRef2) = @_;
    my %count;
    $count{$_}++ foreach @$aryRef1, @$aryRef2;
    return grep{ $count{$_} == 1 } keys %count;
}

my @array1 = (1 .. 28000);
my @array2 = (1..1000, 1..100); # create some duplicates with duplicat
+es

my %hashBase;
@hashBase{@array1} = undef;

sub faq {
    my @dup = intersect( @array1, @array2 );
}
sub hash {
    my %hashAry;
    @hashAry{@array1} = undef;
    my @dup = delete @hashAry{@array2};
}
sub prehash {
    my %hashAry = %hashBase;
    my @dup = delete @hashAry{@array2};

}
sub hash_lret {
    my %hashAry;
    @hashAry{@array1} = undef;
    my @dup = delete @hashAry{@array2};
    return keys %hashAry;
}

sub prehash_lret {
    my %hashAry = %hashBase;
    my @dup = delete @hashAry{@array2};
    return keys %hashAry;
}

sub grepped {
    my %reject;
    $reject{$_} = 1 for @array2;
    my @clean=grep !$reject{$_}, @array1;
}

cmpthese 100,{
              faq          => \&faq,
              hash         => \&hash,
              prehash      => \&prehash,
              hash_lret    => \&hash_lret,
              prehash_lret => \&prehash_lret,
              grepped      => \&grepped,
             };
[download]


__END__
Benchmark: timing 100 iterations of faq, grepped, hash, hash_lret, pre
+hash, prehash_lret...
       faq: 35 wallclock secs (34.09 usr +  0.02 sys = 34.11 CPU) @  2
+.93/s (n=100)
   grepped: 11 wallclock secs ( 9.91 usr +  0.00 sys =  9.91 CPU) @ 10
+.09/s (n=100)
      hash: 10 wallclock secs (10.27 usr +  0.00 sys = 10.27 CPU) @  9
+.74/s (n=100)
 hash_lret: 10 wallclock secs (10.23 usr +  0.00 sys = 10.23 CPU) @  9
+.77/s (n=100)
   prehash: 16 wallclock secs (15.13 usr +  0.00 sys = 15.13 CPU) @  6
+.61/s (n=100)
prehash_lret: 16 wallclock secs (15.17 usr +  0.00 sys = 15.17 CPU) @ 
+ 6.59/s (n=100)
               Rate       faq prehash_lret  prehash      hash hash_lre
+t  grepped
faq          2.93/s        --         -56%     -56%      -70%      -70
+%     -71%
prehash_lret 6.59/s      125%           --      -0%      -32%      -33
+%     -35%
prehash      6.61/s      126%           0%       --      -32%      -32
+%     -35%
hash         9.74/s      232%          48%      47%        --       -0
+%      -4%
hash_lret    9.77/s      233%          48%      48%        0%        -
+-      -3%
grepped      10.1/s      244%          53%      53%        4%        3
+%       --
[download]

the _lret results are the ones that actually return a list. And the prehash vs hash has to do with how the variables are initialized. The results I think are suprising.

Incidentally unless Benchmark::Timer is timing _long_ events it would lose accuracy, perhaps considerably. On most boxes the resolution is around 1/100th of a second. That is why Benchmark is written the way it is.

This is one of those things that makes benchmarking different abstract solutions against each other very difficult. Ultimately the only real way to benchmark is to use a live solution and time it against a different design under the same enviornment.

--- demerphq
my friends call me, usually because I'm late....

Comment on Re: Re: How to splice out multiple entries from an array Select or Download Code

Replies are listed 'Best First'.
Re: Re: Re: How to splice out multiple entries from an array by BrowserUk (Patriarch) on Oct 27, 2002 at 15:28 UTC
Thats why you used Benchmark::Timer right? The reason I used Benchmark::Timer is because it allowed me to time the deletion process, rather than incorporating the overhead of rebuilding the data for each pass. This rebuilding process is only required, because what we are trying to measure is a destructive process and we are repeating it many times for the purposes of benchmarking. It would not be a part of a real application, so should not be allowed to influence the timing of the code under test. You may consider that by incorporating the rebuilding process into all the tests makes for a valid comparison, but it doesn't. When the code required ONLY for the purposes of testing takes orders of magnitude longer, than the code you are actually trying to measure, it has the effect of swamping the data you are interested in to the point where the information gleaned is useless. Given algorithm A taking 1.1 time units and algorithm B taking 0.5 time units and a test setup overhead of 100 time units. `( (100 + 1.1) / (100 + 0.5) ) * 100 = 100.597%` This makes it look like Algorithm B is 0.6% better than the Algorithm A. Now remove the testing overhead from the equation. `( ( 1.1 ) / (0.5) ) * 100 = 220%` Algorith A is 120% slower than Algorithm B, Or. `( ( 0.5) / (1.1) ) * 100 = 45%` Algorithm B takes less than half the time that Algorithm A took. It is very difficult to prevent this inclusion of artificial overhead in the timings using Benchmark::cmpthese. However, I have achieved this by using a fresh, preallocated lookup array or hash for each iteration of the benchmark, therebye removing the need to incorporate the re-building of the destructively modified data into the benchmark timings. Hopefully, this test methodology will meet with your approval. Considering that all of the others had initialization overhead involved i decided that your approach needed it as well. I realize that this is not enitrely fair, considering that your approach would not incur this overhead under some designs. However the same argument applies to the other solutions to a certain degree. This is not just 'not enitrely fair', its wrong. As stated in the post with my original benchmark, if the application is only using the array as a lookup table, and doesn't have the need to access the array elements chronologically by index, then using a hash inplace of the array (as opposed to as well as the array) facilitates the use of a hash slice and the delete function to do the intersection. This removes any overhead associated with building a temporary hash to achieve the intersection, and means that the entire process of iteration over the exclusion set can be performed in machine code rather than at application level. It also removes the need to regenerate the lookup table as the delete function modifies the hash directly. With both the other methods, the building of the temporary hash is fundemental to their function, as is iteration over the lookup table at application level. It is the very removal of these costly factors, and performing the intersection in an atomic (from the POV of the application) at machine code level that is responsible for the huge hike in efficiency. This is exactly what the OP was asking for. If the OP's application did not fit these requirements, then the the technique would not be valid for his purpose, but as we do not have any knowledge of those requirements, I offered the benchmark so that he could make that judgment for himself. The problem with your benchmark is that as well as incorporating the testing overhead as described above, you imposed the generation of a new hash every time. This wouldn't be necessary as the hash would always exists. In fact it would be the only place where the lookup table would be stored. If you re-read my post you'll see that I say that "%hash1 in place of @array1" In a second version you then also imposed the generation of an array from the keys of the hash. Why? This would never be used, so there is no need to generate it. Given a large list of values to be compared against, and an array of things to look up, using the other methods requires two steps to complete the process. Build a hash from the array of things to look up. Filter the list against the hash. The difference is exemplified by the case where a second array of values needs to be excluded from the big list. Using the hash&grep method, this second array needs to be hashed. Using delete, the array can be used directly. The overhead of building the temporary hash is an inherent part of the hash and grep algorithm. By maintaining the lookup table in a hash instead of an array, it facilitates using delete on a hash slice with the added bonus that the construction of the temporary hash is unnecessary. Of course, this method will not work if other parts of the program need to access the individual values of either the pre- or post-removal lists in their original ordering. As was discused elsewhere recently, the need to access arrays by index is actually a fairly rare occurrence in perl. This is even more true as the indexes will have change after the deletion anyway. Although the relative ordering will remain. Using Benchmark::cmpthese to compare the two algorithms (I dropped the FAQ version as we both agree that it is wildly inefficient, serving only as an aid to understanding), having isolated the code performing the intersection from code only used to facilitate testing, I arrived at these timing. `Rate grepmethod deletemethod grepmethod 1.99/s -- -99% deletemethod 270/s 13432% --` [download] The full output is shown at the bottom of the test program below, and you will notice that the testing of the delete method is reported as not having been run enough times for accurate measurement. The method is so efficient that it is impossible on my box with 256MB ram to create enough test data to allow it to run enough times to prevent this warning. Whilst this may allow the 13000% better performance figure to be challenged, the fact that it is at least 2 orders of magnitude more efficient cannot be. I've now benchmarked this four different ways and the results are always the same, at least 10,000% or 2-orders of magnitude quicker. Code and full benchmark results. Read more... (4 kB) Nah! Your thinking of Simon Templar, originally played by Roger Moore and later by Ian Ogilvy	[reply] [d/l] [select]

Replies are listed 'Best First'.

Re: Re: Re: How to splice out multiple entries from an array
by BrowserUk (Patriarch) on Oct 27, 2002 at 15:28 UTC

Thats why you used Benchmark::Timer right?

The reason I used Benchmark::Timer is because it allowed me to time the deletion process, rather than incorporating the overhead of rebuilding the data for each pass. This rebuilding process is only required, because what we are trying to measure is a destructive process and we are repeating it many times for the purposes of benchmarking. It would not be a part of a real application, so should not be allowed to influence the timing of the code under test.

You may consider that by incorporating the rebuilding process into all the tests makes for a valid comparison, but it doesn't. When the code required ONLY for the purposes of testing takes orders of magnitude longer, than the code you are actually trying to measure, it has the effect of swamping the data you are interested in to the point where the information gleaned is useless.

Given algorithm A taking 1.1 time units and algorithm B taking 0.5 time units and a test setup overhead of 100 time units.

( (100 + 1.1) / (100 + 0.5) ) * 100 = 100.597% This makes it look like Algorithm B is 0.6% better than the Algorithm A.

Now remove the testing overhead from the equation.

( ( 1.1 ) / (0.5) ) * 100 = 220% Algorith A is 120% slower than Algorithm B, Or.

( ( 0.5) / (1.1) ) * 100 = 45% Algorithm B takes less than half the time that Algorithm A took.

It is very difficult to prevent this inclusion of artificial overhead in the timings using Benchmark::cmpthese.

However, I have achieved this by using a fresh, preallocated lookup array or hash for each iteration of the benchmark, therebye removing the need to incorporate the re-building of the destructively modified data into the benchmark timings. Hopefully, this test methodology will meet with your approval.

Considering that all of the others had initialization overhead involved i decided that your approach needed it as well. I realize that this is not enitrely fair, considering that your approach would not incur this overhead under some designs. However the same argument applies to the other solutions to a certain degree.

This is not just 'not enitrely fair', its wrong. As stated in the post with my original benchmark, if the application is only using the array as a lookup table, and doesn't have the need to access the array elements chronologically by index, then using a hash inplace of the array (as opposed to as well as the array) facilitates the use of a hash slice and the delete function to do the intersection. This removes any overhead associated with building a temporary hash to achieve the intersection, and means that the entire process of iteration over the exclusion set can be performed in machine code rather than at application level. It also removes the need to regenerate the lookup table as the delete function modifies the hash directly. With both the other methods, the building of the temporary hash is fundemental to their function, as is iteration over the lookup table at application level.

It is the very removal of these costly factors, and performing the intersection in an atomic (from the POV of the application) at machine code level that is responsible for the huge hike in efficiency.

This is exactly what the OP was asking for. If the OP's application did not fit these requirements, then the the technique would not be valid for his purpose, but as we do not have any knowledge of those requirements, I offered the benchmark so that he could make that judgment for himself.

The problem with your benchmark is that as well as incorporating the testing overhead as described above, you imposed the generation of a new hash every time. This wouldn't be necessary as the hash would always exists. In fact it would be the only place where the lookup table would be stored. If you re-read my post you'll see that I say that "%hash1 in place of @array1" In a second version you then also imposed the generation of an array from the keys of the hash. Why? This would never be used, so there is no need to generate it.

Given a large list of values to be compared against, and an array of things to look up, using the other methods requires two steps to complete the process.

Build a hash from the array of things to look up.
Filter the list against the hash.

The difference is exemplified by the case where a second array of values needs to be excluded from the big list. Using the hash&grep method, this second array needs to be hashed. Using delete, the array can be used directly. The overhead of building the temporary hash is an inherent part of the hash and grep algorithm. By maintaining the lookup table in a hash instead of an array, it facilitates using delete on a hash slice with the added bonus that the construction of the temporary hash is unnecessary.

Of course, this method will not work if other parts of the program need to access the individual values of either the pre- or post-removal lists in their original ordering. As was discused elsewhere recently, the need to access arrays by index is actually a fairly rare occurrence in perl. This is even more true as the indexes will have change after the deletion anyway. Although the relative ordering will remain.

Using Benchmark::cmpthese to compare the two algorithms (I dropped the FAQ version as we both agree that it is wildly inefficient, serving only as an aid to understanding), having isolated the code performing the intersection from code only used to facilitate testing, I arrived at these timing.

               Rate   grepmethod deletemethod
grepmethod   1.99/s           --         -99%
deletemethod  270/s       13432%           --
[download]

The full output is shown at the bottom of the test program below, and you will notice that the testing of the delete method is reported as not having been run enough times for accurate measurement. The method is so efficient that it is impossible on my box with 256MB ram to create enough test data to allow it to run enough times to prevent this warning. Whilst this may allow the 13000% better performance figure to be challenged, the fact that it is at least 2 orders of magnitude more efficient cannot be. I've now benchmarked this four different ways and the results are always the same, at least 10,000% or 2-orders of magnitude quicker.

Code and full benchmark results.