Re: performance issues with grepping large sorted arrays

If array A is sorted by using the string cmp operator you can employ the binary search algorithm to determine if $item is in the array.

The basic idea is to see if the item is greater than or less than the middle item in the array, if its greater, binary search the top half of the array. If its less, binary search the bottom half. You'll eventually narrow it down to a one element array or hit it along the way at a midpoint. The runtime of this algorithm is O(lg n), versus the O(n) of using grep

A recursive implementation follows:

# Takes an array ref and a target item
# Returns the index of the item in the array or
# -1 if its not there.
sub binsearch {
  my $arref = shift;
  my $target = shift;

  my $len = scalar @$arref;
  my $midpoint = floor( $len / 2); # Get the center, rounding down whe
+n we have 1 element left we don't get undef
  my $cmp = $target cmp $arref->[$midpoint]; # make the comparison onc
+e

  return $midpoint unless $cmp; # We've found it!
  return -1 if $len == 1; # Must not be there
  return binsearch ([$arref->[0..$midpoint],$target) if $cmp == -1; # 
+Its in the lower half, check to the midpoint because its rounded down
  return binsearch ([$arref->[$midpoint+1..$len-1],$target) if $cmp ==
+ 1; # Its in the upper half, check to the midpoint +1 for reason stat
+ed above
  croak "Something really weird happened. Might want to run make test 
+on perl itsself.";
}
[download]

This will still take time with 800k elements. If speed is really an issue consider PDL or doing it in C.

Comment on Re: performance issues with grepping large sorted arrays Select or Download Code

Replies are listed 'Best First'.
Re^2: performance issues with grepping large sorted arrays by ikegami (Patriarch) on Jul 24, 2007 at 02:08 UTC
Here's a version that's better in three ways: It's not recursive, making it much more efficient. Any sorting function can be used, not just `cmp`. The returned value is much more useful. Code: <Reveal this spoiler or all in this thread>	[reply] [d/l] [select]
Re^3: performance issues with grepping large sorted arrays by mr_mischief (Monsignor) on Jul 24, 2007 at 16:34 UTC
The problem still is that the OP said they sorted the array just to try to get the grep to be faster. Yes, binary search is faster than a straight linear search. No, it's not going to be as fast as eliminating 1000 comparisons altogether. `for ( @array ) { if ( exists $hash{$_} ) { # do something } else { # do something else } }` [download] Here, you have one hash lookup for each array item. `@array = sort @array; foreach ( keys %hash ) { if ( is_in_array( $_ ) ) { # do something } else { # do something else } }` [download] Here, you're doing a sort (quicksort, if using Perl's sort(), which is O(n log(n)) and worst case O(n*2) ), then using binary search which is O(log2 n) for each key of the hash. So your solution ends up being O( (n log(n)) + (m log n) ) with a worst case of O(n*2 + m log(n)). Just going through the loop and doing a hash lookup is predictably O(n). In simplest terms, with an 800k element array you can do 800k comparisons the way GrandFather and I have been recommending. In order to do a quicksort then a binary search for each key of the hash, you're possibly talking about 800k * 800k + 1000 * log2(800k). log2(800,000) is roughly 20. So 640,000,020,000 comparisons in the worst case, even using fairly good sort and search algorithms. The loop is simply inside out.	[reply] [d/l] [select]