Re^2: performance issues with grepping large sorted arrays

Here's a version that's better in three ways:

It's not recursive, making it much more efficient.
Any sorting function can be used, not just cmp.
The returned value is much more useful.

Code:

sub _unsigned_to_signed {
   return unpack('i', pack('I', $_[0]));
}

sub binsearch(&$\@) {
   my ($compare, $value, $array) = @_;

   #
   # If the $value is not found in @array,
   # the 1s complement of where the $value
   # should be inserted is returned. So,
   # $value is found if the return value >= 0.
   #
   # When compare is called, $a is an alias for $value,
   # and $b is an alias for the element of the array
   # against which $a which be compared.
   #
   # Example:
   #    # Check if $value is in sorted @array.
   #    if ((binsearch { $a <=> $b } $value, @array) >= 0) {
   #       print("present\n");
   #    } else {
   #       print("absent\n");
   #    }
   #
   # Example:
   #    # Add $value to sorted @array, if it's not already there.
   #    $idx = binsearch { $a <=> $b } $value, @array;
   #    splice(@array, ~$idx, 0, $value) if ($idx < 0);
   #

   my $i = 0;
   my $j = $#$array;

   return $j if ($j == -1);

   my $ap = do { no strict 'refs'; \*{caller().'::a'} };
   my $bp = do { no strict 'refs'; \*{caller().'::b'} };

   local *$ap;
   local *$bp;

   *$ap = \$value;

   for (;;) {
      my $k = int(($j-$i)/2) + $i;
      *$bp = \($array->[$k]);

      my $cmp = &$compare();
      return $k if ($cmp == 0);

      if ($cmp < 0) {
         $j = $k-1;
         return _unsigned_to_signed(~$k) if ($i > $j);
      } else {
         $i = $k+1;
         return _unsigned_to_signed(~$i) if ($i > $j);
      }      
   }
}
[download]

Comment on Re^2: performance issues with grepping large sorted arrays Select or Download Code

Replies are listed 'Best First'.
Re^3: performance issues with grepping large sorted arrays by mr_mischief (Monsignor) on Jul 24, 2007 at 16:34 UTC
The problem still is that the OP said they sorted the array just to try to get the grep to be faster. Yes, binary search is faster than a straight linear search. No, it's not going to be as fast as eliminating 1000 comparisons altogether. `for ( @array ) { if ( exists $hash{$_} ) { # do something } else { # do something else } }` [download] Here, you have one hash lookup for each array item. `@array = sort @array; foreach ( keys %hash ) { if ( is_in_array( $_ ) ) { # do something } else { # do something else } }` [download] Here, you're doing a sort (quicksort, if using Perl's sort(), which is O(n log(n)) and worst case O(n*2) ), then using binary search which is O(log2 n) for each key of the hash. So your solution ends up being O( (n log(n)) + (m log n) ) with a worst case of O(n*2 + m log(n)). Just going through the loop and doing a hash lookup is predictably O(n). In simplest terms, with an 800k element array you can do 800k comparisons the way GrandFather and I have been recommending. In order to do a quicksort then a binary search for each key of the hash, you're possibly talking about 800k * 800k + 1000 * log2(800k). log2(800,000) is roughly 20. So 640,000,020,000 comparisons in the worst case, even using fairly good sort and search algorithms. The loop is simply inside out.	[reply] [d/l] [select]

Replies are listed 'Best First'.

Re^3: performance issues with grepping large sorted arrays
by mr_mischief (Monsignor) on Jul 24, 2007 at 16:34 UTC

for ( @array ) {
    if ( exists $hash{$_} ) {
        # do something
    } else {
        # do something else
    }
}
[download]

@array = sort @array;
foreach ( keys %hash ) {
    if ( is_in_array( $_ ) ) {
       # do something
    } else {
       # do something else
    }
}
[download]

GrandFather

[reply]
[d/l]
[select]