in reply to performance issues with grepping large sorted arrays

If array A is sorted by using the string cmp operator you can employ the binary search algorithm to determine if $item is in the array.

The basic idea is to see if the item is greater than or less than the middle item in the array, if its greater, binary search the top half of the array. If its less, binary search the bottom half. You'll eventually narrow it down to a one element array or hit it along the way at a midpoint. The runtime of this algorithm is O(lg n), versus the O(n) of using grep

A recursive implementation follows:

# Takes an array ref and a target item # Returns the index of the item in the array or # -1 if its not there. sub binsearch { my $arref = shift; my $target = shift; my $len = scalar @$arref; my $midpoint = floor( $len / 2); # Get the center, rounding down whe +n we have 1 element left we don't get undef my $cmp = $target cmp $arref->[$midpoint]; # make the comparison onc +e return $midpoint unless $cmp; # We've found it! return -1 if $len == 1; # Must not be there return binsearch ([$arref->[0..$midpoint],$target) if $cmp == -1; # +Its in the lower half, check to the midpoint because its rounded down return binsearch ([$arref->[$midpoint+1..$len-1],$target) if $cmp == + 1; # Its in the upper half, check to the midpoint +1 for reason stat +ed above croak "Something really weird happened. Might want to run make test +on perl itsself."; }

This will still take time with 800k elements. If speed is really an issue consider PDL or doing it in C.

Replies are listed 'Best First'.
Re^2: performance issues with grepping large sorted arrays
by ikegami (Patriarch) on Jul 24, 2007 at 02:08 UTC

    Here's a version that's better in three ways:

    • It's not recursive, making it much more efficient.
    • Any sorting function can be used, not just cmp.
    • The returned value is much more useful.

    Code:

      The problem still is that the OP said they sorted the array just to try to get the grep to be faster.

      Yes, binary search is faster than a straight linear search.

      No, it's not going to be as fast as eliminating 1000 comparisons altogether.

      for ( @array ) { if ( exists $hash{$_} ) { # do something } else { # do something else } }
      Here, you have one hash lookup for each array item.
      @array = sort @array; foreach ( keys %hash ) { if ( is_in_array( $_ ) ) { # do something } else { # do something else } }
      Here, you're doing a sort (quicksort, if using Perl's sort(), which is O(n log(n)) and worst case O(n**2) ), then using binary search which is O(log2 n) for each key of the hash.

      So your solution ends up being O( (n log(n)) + (m * log n) ) with a worst case of O(n**2 + m * log(n)). Just going through the loop and doing a hash lookup is predictably O(n).

      In simplest terms, with an 800k element array you can do 800k comparisons the way GrandFather and I have been recommending. In order to do a quicksort then a binary search for each key of the hash, you're possibly talking about 800k * 800k + 1000 * log2(800k). log2(800,000) is roughly 20. So 640,000,020,000 comparisons in the worst case, even using fairly good sort and search algorithms.

      The loop is simply inside out.