in reply to performance issues with grepping large sorted arrays

You're using a data structure which takes a constant time to access a single elements, and you're searching through all of its elements then searching a second data structure that takes a time linear to the number of elements in it in order to see if the two have an element in common. As GrandFather suggests, this is backwards.

You can, as GrandFather noted, turn this inside out:
foreach ( @array ) { if ( exists $hash{$_} ) { # do something } else { # do something else } }
You could also, if you don't have to build both structures (which we cannot tell without more context), do something more like this:
my %hash = ( ... ); while ( $_ = get_item() ) { if ( exists $hash{ $_ } ) { ### update: fixed syntax error typo # do something } else { # do something else } }
... which would keep you from ever building the huge array (where get_item() returns one value that would have been assigned to an element of @array in the other version). As I said, though, we can't tell if you need the array without more information.

The former solution assumes you needed to have the array anyway, while the second hopes to avoid it if possible. If there's nothing you can do about having such a large array, you'll at least want to do the lookups using the hash.

In your original solution, you end up with the running time being the product of a number linear to the number of keys in the hash times a number linear to the number of elements in the array. Think f(n*o). You could scan linearly through the array, and do a constant time lookup on the hash each time f(n), or more specifically f(n+x*n), but that's trivially f(n), which is what GrandFather's suggestion gets you. The necessary time is cut roughly by a factor of the number of keys in the hash, which is a big amount when you're talking about 1000 keys. 1/1000 of the time is always a good speedup.

The solution that would let you not use the array (if possible) would likely be a bit faster still because it wouldn't require the memory for the array. An array with 800k elements is likely to get into a page file somewhere if the elements are of any size at all. So I guess the question here is, where is the array from, and how is it built up?

Replies are listed 'Best First'.
Re^2: performance issues with grepping large sorted arrays
by HarshaHegde (Sexton) on Jul 24, 2007 at 21:02 UTC
    Dear Monks,

    Thanks to all for the replies. I could learn a thing or two about searching algorithms and complexities of algorithms through the replies. I figured out the easiest and quickest way would be to not use arrays at all. Instead I used hashes to store both lists. Then just did 'exists' on each of these hashes. Now the script completes in about 10 seconds.

    Thanks,