Ya has asked for the wisdom of the Perl Monks concerning the following question:
@arr1 = (1, 22, , 52, 25, 3, 6, 8, 9, 10); @arr2 = (25, 52, 8); ... #result I need is as follows @arr1 now should has (1, 22, , 3, 6, 9, 1 +0)
|
|---|
| Replies are listed 'Best First'. | |
|---|---|
|
Re: How to splice out multiple entries from an array
by hossman (Prior) on Oct 24, 2002 at 07:28 UTC | |
| [reply] [d/l] [select] |
by demerphq (Chancellor) on Oct 24, 2002 at 13:02 UTC | |
If you read the code, and assume that @arr1 is large, without many duplicates, and that @arr2 is small without many duplicates either, then it would appear that the above will iterate over the values of @arr1 twice. Also the above has the problem that the resulting array will have no ordering relationship to the original one. I think a superior method is the one I posted in my orphaned reply. Or rather a minor variant that doesn use hash slices. This code will only walk @arr1 and @arr2 once. Note that this is one of the weaknesses in 'big O notation'. Both of these solutions are O(N), however, the worst cases of both are 2(N+M) and N+M, thus the later has a much better worst case even though they are the same order of solution. BrowserUk posted a benchmark that I will mod to benchmark the various solutions, including the one above. Regards,
--- demerphq | [reply] [d/l] |
by demerphq (Chancellor) on Oct 24, 2002 at 08:57 UTC | |
:-)
--- demerphq | [reply] |
by Ya (Initiate) on Oct 24, 2002 at 11:31 UTC | |
Best Regards, Ya | [reply] |
|
Re: How to splice out multiple entries from an array
by BrowserUk (Patriarch) on Oct 24, 2002 at 12:33 UTC | |
If you need to generate the intersection between your array1 and another array more than once, then there is a way of speeding this process up by a huge factor (over 10000%!) relative to using the FAQ intersection method. Instead of generating a new hash from array's 1 & 2 each time you need to do this, you can use a %hash1 in place of @array1. In use, you would need to swap your array1 references from $array1[n] to $aryHash1{n};, you wouldn't be able to have duplicates in aryHash1, and it would use additional memory, but less than the combined space of an array and a temporary hash for doing the intersection. Benchmark Read more... (2 kB)
By my calculation, this is an 1037/8.81*100 = 11770% faster! Did I make a mistake? Cor! Like yer ring! ... HALO dammit! ... 'Ave it yer way! Hal-lo, Mister la-de-da. ... Like yer ring! | [reply] [d/l] [select] |
by demerphq (Chancellor) on Oct 24, 2002 at 16:05 UTC | |
Despite the difficulties I gave it a crack (using two different ways to set the hash based solution up) and this is what I came up with Read more... (2 kB)
the _lret results are the ones that actually return a list. And the prehash vs hash has to do with how the variables are initialized. The results I think are suprising. Incidentally unless Benchmark::Timer is timing _long_ events it would lose accuracy, perhaps considerably. On most boxes the resolution is around 1/100th of a second. That is why Benchmark is written the way it is. This is one of those things that makes benchmarking different abstract solutions against each other very difficult. Ultimately the only real way to benchmark is to use a live solution and time it against a different design under the same enviornment.
--- demerphq | [reply] [d/l] [select] |
by BrowserUk (Patriarch) on Oct 27, 2002 at 15:28 UTC | |
Thats why you used Benchmark::Timer right? The reason I used Benchmark::Timer is because it allowed me to time the deletion process, rather than incorporating the overhead of rebuilding the data for each pass. This rebuilding process is only required, because what we are trying to measure is a destructive process and we are repeating it many times for the purposes of benchmarking. It would not be a part of a real application, so should not be allowed to influence the timing of the code under test. You may consider that by incorporating the rebuilding process into all the tests makes for a valid comparison, but it doesn't. When the code required ONLY for the purposes of testing takes orders of magnitude longer, than the code you are actually trying to measure, it has the effect of swamping the data you are interested in to the point where the information gleaned is useless. Given algorithm A taking 1.1 time units and algorithm B taking 0.5 time units and a test setup overhead of 100 time units. ( (100 + 1.1) / (100 + 0.5) ) * 100 = 100.597% This makes it look like Algorithm B is 0.6% better than the Algorithm A. Now remove the testing overhead from the equation. ( ( 1.1 ) / (0.5) ) * 100 = 220% Algorith A is 120% slower than Algorithm B, Or. ( ( 0.5) / (1.1) ) * 100 = 45% Algorithm B takes less than half the time that Algorithm A took. It is very difficult to prevent this inclusion of artificial overhead in the timings using Benchmark::cmpthese. However, I have achieved this by using a fresh, preallocated lookup array or hash for each iteration of the benchmark, therebye removing the need to incorporate the re-building of the destructively modified data into the benchmark timings. Hopefully, this test methodology will meet with your approval. Considering that all of the others had initialization overhead involved i decided that your approach needed it as well. I realize that this is not enitrely fair, considering that your approach would not incur this overhead under some designs. However the same argument applies to the other solutions to a certain degree. This is not just 'not enitrely fair', its wrong. As stated in the post with my original benchmark, if the application is only using the array as a lookup table, and doesn't have the need to access the array elements chronologically by index, then using a hash inplace of the array (as opposed to as well as the array) facilitates the use of a hash slice and the delete function to do the intersection. This removes any overhead associated with building a temporary hash to achieve the intersection, and means that the entire process of iteration over the exclusion set can be performed in machine code rather than at application level. It also removes the need to regenerate the lookup table as the delete function modifies the hash directly. With both the other methods, the building of the temporary hash is fundemental to their function, as is iteration over the lookup table at application level. It is the very removal of these costly factors, and performing the intersection in an atomic (from the POV of the application) at machine code level that is responsible for the huge hike in efficiency. This is exactly what the OP was asking for. If the OP's application did not fit these requirements, then the the technique would not be valid for his purpose, but as we do not have any knowledge of those requirements, I offered the benchmark so that he could make that judgment for himself. The problem with your benchmark is that as well as incorporating the testing overhead as described above, you imposed the generation of a new hash every time. This wouldn't be necessary as the hash would always exists. In fact it would be the only place where the lookup table would be stored. If you re-read my post you'll see that I say that "%hash1 in place of @array1" In a second version you then also imposed the generation of an array from the keys of the hash. Why? This would never be used, so there is no need to generate it. Given a large list of values to be compared against, and an array of things to look up, using the other methods requires two steps to complete the process. The difference is exemplified by the case where a second array of values needs to be excluded from the big list. Using the hash&grep method, this second array needs to be hashed. Using delete, the array can be used directly. The overhead of building the temporary hash is an inherent part of the hash and grep algorithm. By maintaining the lookup table in a hash instead of an array, it facilitates using delete on a hash slice with the added bonus that the construction of the temporary hash is unnecessary. Of course, this method will not work if other parts of the program need to access the individual values of either the pre- or post-removal lists in their original ordering. As was discused elsewhere recently, the need to access arrays by index is actually a fairly rare occurrence in perl. This is even more true as the indexes will have change after the deletion anyway. Although the relative ordering will remain. Using Benchmark::cmpthese to compare the two algorithms (I dropped the FAQ version as we both agree that it is wildly inefficient, serving only as an aid to understanding), having isolated the code performing the intersection from code only used to facilitate testing, I arrived at these timing.
The full output is shown at the bottom of the test program below, and you will notice that the testing of the delete method is reported as not having been run enough times for accurate measurement. The method is so efficient that it is impossible on my box with 256MB ram to create enough test data to allow it to run enough times to prevent this warning. Whilst this may allow the 13000% better performance figure to be challenged, the fact that it is at least 2 orders of magnitude more efficient cannot be. I've now benchmarked this four different ways and the results are always the same, at least 10,000% or 2-orders of magnitude quicker. Code and full benchmark results. Read more... (4 kB)
Nah! Your thinking of Simon Templar, originally played by Roger Moore and later by Ian Ogilvy | [reply] [d/l] [select] |
|
Re: How to splice out multiple entries from an array
by demerphq (Chancellor) on Oct 24, 2002 at 08:49 UTC | |
The running time of this is N+M, ie the number of elements in both arrays summed together. Here is the relevant FAQ (in fact it explains how to do array union, intersection, and difference, the above only does the difference.) Read more... (1383 Bytes)
In future please try doing some rudimentary searching before asking a question. Especially for something simple like this. It almost certainly has been asked and answered time and again.HTH
--- demerphq | [reply] [d/l] [select] |
|
Re: How to splice out multiple entries from an array
by gjb (Vicar) on Oct 24, 2002 at 08:26 UTC | |
If the order of the list is not important, you can use Set::Scalar to do the job.
A solution for the truly lazy, hope this helps, -gjb- | [reply] [d/l] |
|
Re: How to splice out multiple entries from an array
by hotshot (Prior) on Oct 24, 2002 at 07:16 UTC | |
Hotshot | [reply] |