performance issues with grepping large sorted arrays

HarshaHegde has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: performance issues with grepping large sorted arrays by GrandFather (Saint) on Jul 23, 2007 at 23:07 UTC
Two options. Build a lookup hash: `my %lu = map {$_ => 1} @array; foreach my $key (keys %H){ if ($lu{$key}) {` [download] or turn the loop inside out: `foreach my $elt (@A){ if (exists $hash{$elt}) {` [download] In the second case you may need to keep a record of hits so that you don't perform your hit process more than once (assuming that to be what you want). Note that sorting the array will make no difference at all. grep always processes all elements. You could make a small efficiency gain by using a for loop instead of the grep and exiting as soon as a matching element is found. DWIM is Perl's answer to Gödel	[reply] [d/l] [select]
Re: performance issues with grepping large sorted arrays by mr_mischief (Monsignor) on Jul 24, 2007 at 01:16 UTC
You're using a data structure which takes a constant time to access a single elements, and you're searching through all of its elements then searching a second data structure that takes a time linear to the number of elements in it in order to see if the two have an element in common. As GrandFather suggests, this is backwards. You can, as GrandFather noted, turn this inside out: `foreach ( @array ) { if ( exists $hash{$_} ) { # do something } else { # do something else } }` [download] You could also, if you don't have to build both structures (which we cannot tell without more context), do something more like this: `my %hash = ( ... ); while ( $_ = get_item() ) { if ( exists $hash{ $_ } ) { ### update: fixed syntax error typo # do something } else { # do something else } }` [download] ... which would keep you from ever building the huge array (where get_item() returns one value that would have been assigned to an element of @array in the other version). As I said, though, we can't tell if you need the array without more information. The former solution assumes you needed to have the array anyway, while the second hopes to avoid it if possible. If there's nothing you can do about having such a large array, you'll at least want to do the lookups using the hash. In your original solution, you end up with the running time being the product of a number linear to the number of keys in the hash times a number linear to the number of elements in the array. Think f(no). You could scan linearly through the array, and do a constant time lookup on the hash each time f(n), or more specifically f(n+xn), but that's trivially f(n), which is what GrandFather's suggestion gets you. The necessary time is cut roughly by a factor of the number of keys in the hash, which is a big amount when you're talking about 1000 keys. 1/1000 of the time is always a good speedup. The solution that would let you not use the array (if possible) would likely be a bit faster still because it wouldn't require the memory for the array. An array with 800k elements is likely to get into a page file somewhere if the elements are of any size at all. So I guess the question here is, where is the array from, and how is it built up?	[reply] [d/l] [select]
Re^2: performance issues with grepping large sorted arrays by HarshaHegde (Sexton) on Jul 24, 2007 at 21:02 UTC
Dear Monks, Thanks to all for the replies. I could learn a thing or two about searching algorithms and complexities of algorithms through the replies. I figured out the easiest and quickest way would be to not use arrays at all. Instead I used hashes to store both lists. Then just did 'exists' on each of these hashes. Now the script completes in about 10 seconds. Thanks,	[reply]
Re: performance issues with grepping large sorted arrays by wind (Priest) on Jul 23, 2007 at 23:00 UTC
use List::MoreUtils qw(any); - Miller	[reply]
Re: performance issues with grepping large sorted arrays by almut (Canon) on Jul 23, 2007 at 23:16 UTC
Is there anyway I can make this run faster? Don't use a data structure that requires a linear search. Use a hash instead, or, if that would get too large to fit into memory, use an indexed database, e.g. Berkeley DB (you might want to take a look at this recent thread, where a similar problem had been discussed). If you really need to use an array, search it in a short-circuit way, i.e. don't use `grep`, but rather a `for` loop in combination with `last`, to exit the loop as soon as you've found a match: `my $found = 0; for (@A) { if ($_ eq $key) { $found = 1; last } } if ($found) { ...` [download]	[reply] [d/l] [select]
Re^2: performance issues with grepping large sorted arrays by Trizor (Pilgrim) on Jul 23, 2007 at 23:34 UTC
Arrays don't require a linear search, see my post about binary search. It requires a sorted array, but if you've already got that... In terms of algorithmic complexity short circuiting has no impact on efficiency because the worst case is still O(n) and the average case is O(n/2) which simplifies to O(n), which is still greater than O(lg n)	[reply]
Re^3: performance issues with grepping large sorted arrays by almut (Canon) on Jul 24, 2007 at 00:16 UTC
I absolutely agree that a binary search would be faster if you have a sorted array (though sorting the array in the first place would also take quite some time, and I was under the impression that the OP would only be doing that in an attempt to speed up things). As to "the worst case is still O(n) and the average case is O(n/2) which simplifies to O(n)": I think in most real life situations, you're very unlikely to encounter the worst case scenario (in the OP's case, it would mean that every $key being looked up would match the last element of the array, or is not found at all). Sure, ultimately it's a statistical issue, but in the typical case there will be some improvement when short-circuiting the search. But that's a moot point anyway, as there are even better alternatives... (like hash tables, with constant-time lookup on average)	[reply]
Re^4: performance issues with grepping large sorted arrays by gam3 (Curate) on Jul 24, 2007 at 14:25 UTC
Re^5: performance issues with grepping large sorted arrays by mr_mischief (Monsignor) on Jul 24, 2007 at 16:43 UTC
Re^5: performance issues with grepping large sorted arrays by almut (Canon) on Jul 24, 2007 at 17:35 UTC
Some notes below your chosen depth have not been shown here
Re: performance issues with grepping large sorted arrays by Trizor (Pilgrim) on Jul 23, 2007 at 23:17 UTC
If array A is sorted by using the string cmp operator you can employ the binary search algorithm to determine if $item is in the array. The basic idea is to see if the item is greater than or less than the middle item in the array, if its greater, binary search the top half of the array. If its less, binary search the bottom half. You'll eventually narrow it down to a one element array or hit it along the way at a midpoint. The runtime of this algorithm is O(lg n), versus the O(n) of using `grep` A recursive implementation follows: # Takes an array ref and a target item # Returns the index of the item in the array or # -1 if its not there. sub binsearch { my $arref = shift; my $target = shift; my $len = scalar @$arref; my $midpoint = floor( $len / 2); # Get the center, rounding down whe +n we have 1 element left we don't get undef my $cmp = $target cmp $arref->[$midpoint]; # make the comparison onc +e return $midpoint unless $cmp; # We've found it! return -1 if $len == 1; # Must not be there return binsearch ([$arref->[0..$midpoint],$target) if $cmp == -1; # +Its in the lower half, check to the midpoint because its rounded down return binsearch ([$arref->[$midpoint+1..$len-1],$target) if $cmp == + 1; # Its in the upper half, check to the midpoint +1 for reason stat +ed above croak "Something really weird happened. Might want to run make test +on perl itsself."; } [download] This will still take time with 800k elements. If speed is really an issue consider PDL or doing it in C.	[reply] [d/l] [select]
Re^2: performance issues with grepping large sorted arrays by ikegami (Patriarch) on Jul 24, 2007 at 02:08 UTC
Here's a version that's better in three ways: It's not recursive, making it much more efficient. Any sorting function can be used, not just `cmp`. The returned value is much more useful. Code: <Reveal this spoiler or all in this thread>	[reply] [d/l] [select]
Re^3: performance issues with grepping large sorted arrays by mr_mischief (Monsignor) on Jul 24, 2007 at 16:34 UTC
The problem still is that the OP said they sorted the array just to try to get the grep to be faster. Yes, binary search is faster than a straight linear search. No, it's not going to be as fast as eliminating 1000 comparisons altogether. `for ( @array ) { if ( exists $hash{$_} ) { # do something } else { # do something else } }` [download] Here, you have one hash lookup for each array item. `@array = sort @array; foreach ( keys %hash ) { if ( is_in_array( $_ ) ) { # do something } else { # do something else } }` [download] Here, you're doing a sort (quicksort, if using Perl's sort(), which is O(n log(n)) and worst case O(n*2) ), then using binary search which is O(log2 n) for each key of the hash. So your solution ends up being O( (n log(n)) + (m log n) ) with a worst case of O(n*2 + m log(n)). Just going through the loop and doing a hash lookup is predictably O(n). In simplest terms, with an 800k element array you can do 800k comparisons the way GrandFather and I have been recommending. In order to do a quicksort then a binary search for each key of the hash, you're possibly talking about 800k * 800k + 1000 * log2(800k). log2(800,000) is roughly 20. So 640,000,020,000 comparisons in the worst case, even using fairly good sort and search algorithms. The loop is simply inside out.	[reply] [d/l] [select]