paragkalra has asked for the wisdom of the Perl Monks concerning the following question:

Hello All,

I have 2 arrays. I have written a script that searches each and every value of first array in the second array. Each & every value of Source Array should occur only once in the target array but it can occur anywhere.

If any of the source value doesn't occur or occur multiple times then the script reports to the user.

This script is working fine for small arrays.

But I tried executing the script for 2 arrays each of which containing 7,00,0000 elements and it is taking very long time.

I am using nested 'for' loops to make this search. How can I optimize it further to search source array in target array.

Replies are listed 'Best First'.
Re: Searching first array in second array.
by Ratazong (Monsignor) on Jan 19, 2010 at 12:05 UTC

    Hi!

    You write that you use nested for. That is a very inefficient way to do it. Suppose n is the size of the array, you would need O(n^2) compares. In your case it is ~7000000*7000000, which is very much.

    The following algorithm would be much faster O(n) (or ~7000000*3):

    1. create a temporary hash
    2. add all elements of the first array to that hash; use the array-value as a key and 1 as a value
    3. pass through the second array; if the current value is already in the hash, increase its value by 2
    4. pass through the hash; if all values are 3, array1 is part of array2

    Another idea could be to sort both arrays (the perl sort-function is more efficient than a "nested-for-(bubble-)sort") and then work on the sorted arrays.

    HTH, Rata

    PS.: please be aware that the algorithm above does not work if the first array contains some values more than once!
    PPS.: and of course you can improve the algorithm by adding checks on "more than once" in step 3 - and aborting the rest of the loops then

      As soon as you create "buckets" instead of single hashes, the algorithm also works for multiple elements per key:

      my @left = qw(1 2 3 1 1); my %left; for my $l (@left) { my $key = $l; # maybe you want some more complicated key than the +item itself $left{ $key } ||= []; push @{ $left{ $key } }, $l; }; for my $r (@right) { my $key = $r; # maybe you want some more complicated key than the +item itself if (@{ $left{ $key }}) { my $found = shift @{ $left{ $key }}; print "For $key, I found the pair ($found, $r).\n"; } elsif (exists $left{ $key }) { print "For $key, there is at one item on the right side left o +ver: $r.\n"; } else { print "For $key, there is no item on the left side at all: $r. +\n"; }; }; # Now lets see if we have any values on the left side that did not pai +r: for my $l (keys %left) { my @leftover = @{ $left{ $l } }; for (@leftover) { print "For $l, there was no corresponding element on the right + side ($_)\n"; }; };

      I haven't checked that code, but the approach is one I've used lots and lots when reconciling lists :)

Re: Searching first array in second array.
by BrowserUk (Patriarch) on Jan 19, 2010 at 12:41 UTC

    The only thing I'll add to the "use a hash" advice, is don't read the data into two arrays first. You'll likely to run out of memory because of the duplication.

    Far better to build a hash directly from the smaller set as you read it in. And the read the second set one-by-one and test it against the hash. For datasets of this size, that likely to save you a large amount of memory, and quite a bit of time.


    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.
Re: Searching first array in second array.
by rkrieger (Friar) on Jan 19, 2010 at 12:40 UTC
    The thing below (with arbitrary, fabricated arrays) takes approximately 1m15s on my old 600 MHz machine. How does it compare to your running time? If I'm not mistaken, this scans the first and second arrays once, running a count on every valid thing it finds. Anyone with suggestions on how to do better, feel free to comment.
    #!/usr/bin/perl use strict; use warnings; my (@array1, @array2); my %seen; # Fabricate some array use List::Util qw(shuffle); @array2 = shuffle (0..1000000); @array1 = @array2[0..950000]; # Scan our first array foreach my $key (@array1) { $seen{$key} = 0; } # Search our second array foreach my $key (@array2) { # Check if valid key Yes, count it No, warn defined($seen{$key}) ? $seen{$key}++ : warn qq{Key '$key' f +ound, but not searching for it.\n}; } # Determine the item count foreach my $item (keys %seen) { my $count = $seen{$item}; if ($count == 1) { # Only once, as intended } elsif ($count == 0) { # Missing, report print qq{Item '$item' not found in search array.\n}; } else { # Multiple items, report print qq{Item '$item' found multiple ($count) times in + search array.\n}; } }
Re: Searching first array in second array.
by JavaFan (Canon) on Jan 19, 2010 at 12:32 UTC
    Assuming you want a failure if an element from array1 occurs twice in array2, and array1 and array2 don't contain references, I'd do:
    my @array1 = ...; my @array2 = ...; my %hash = map {$_ => 1} @array1; foreach my $e (@array2) { next unless exists $hash{$e}; die "Failure" if --$hash{$e} < 0; } die "Failure" if grep $_, values %hash; print "Success";
    If it's ok an element from array1 may occur more than once in array2, I'd write:
    my @array1 = ...; my @array2 = ...; my %hash; @hash{@array1} = (); delete @hash{@array2}; die "Failure" if %hash; print "Success";
Re: Searching first array in second array.
by cdarke (Prior) on Jan 19, 2010 at 12:05 UTC
    We would need to see your code to say how it could be improved. How are you detecting duplicates? Hopefully you are using a hash.

      I concur with cdarke. Anytime I need to check for duplicates I always think "hash". That's not the only good use of hashes, of course. But it always seems that it is one of the areas that hashes really shine...especially for simplicity and optimized speed when you don't want to spend a lot of time trying to eek out every last microsecond of performance.

      ack Albuquerque, NM
Re: Searching first array in second array.
by paragkalra (Scribe) on Jan 19, 2010 at 19:23 UTC

    @ rkrieger - You are a Genius. The code worked liked a lightning. It was able to process 4,00,000 elements in 2 seconds...Thats indeed Almighty

    Thanks a bunch once again.

    Sorting algorithm couldn't help much as it took more than 20 minutes and I had to finally exit the script execution.

    Perl and Perl monks are indeed making my sorting concepts strong. :)