Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

I am employing a binary search over a fairly large (~1-10Million integers) sorted array. My perl code, which is given below,

(1) returns the index of the array if it finds a match
(2a) if it does NOT find a match and begin=1, returns the index of the array where the "search query must belong"
(2b) if it does NOT find a match and begin=0, returns the index of the array where the "search query must belong" minus 1

I call this subroutine ~400 times, and it takes about 7 minutes. I have no clue why it is taking so long. My colleague who has this subroutine written in python runs is no time. Is Perl's long running time due to the way Perl handles arrays?

Can you please suggest what is wrong with the code, and provide tips to speed-up? Thanks a lot in advance
sub binarySearch ### sample call: $returnValue = binarySearch($begin +, $query, \@ArrayPtr); { my ($begin, $query, $ArrayPtr); ### begin = 1 or 0 ($begin, $query, $ArrayPtr) = @_; my(@array) = @$ArrayPtr; ### Handling cases when query is out of bounds return 0 if ($query < $array[0] && $begin == 1); return $#array if ($query > $array[$#array] && $begin == 0); my ($left) = 0; my ($right) = $#array; my ($center); my ($prevCenter) = -1; while(1) { $center = int (($left + $right)/2 ); if ($query == $array[$center]) { if ($begin == 1) { while($center > 0 && $array[$center] == $array[$center-1]) { $center = $center - 1; } } else { while($center < $#array && $array[$center] == $array[$center+1 +]) { $center = $center + 1; } } undef @array; return $center; } if ($center == $prevCenter) { undef @array; return $right if ($begin == 1); return $right-1 if ($begin == 0); } $right = $center if ($query < $array[$center]); $left = $center if ($query > $array[$center]); $prevCenter = $center; } }

Replies are listed 'Best First'.
Re: My Binary Search Perl code takes forever
by blokhead (Monsignor) on Dec 15, 2007 at 15:49 UTC
    On my computer, it takes about 2 seconds to run on an array with 1 million ints:
    my @arr = (1 .. 1_000_000); my $x = 425983; printf "search for $x? found at position: %s\n", binarySearch(0,$x,\@a +rr);
    However, if I go to 5 million, my operating system starts thrashing. Maybe that's what's happening to you too (memory usage starts becoming an issue). It's also quite possible that python has lower overhead for integer scalars than Perl, but I have no idea.

    One thing I noticed is that you pass in the array by reference (good idea), but the first thing you do is copy it entirely into a local variable. Why not keep accessing it by reference, like this: (I changed some other things around too)

    sub binarySearch { my ($begin, $query, $array) = @_; return 0 if $query < $array->[0] and $begin == 1; return $#$array if $query > $array->[$#$array] and $begin == 0; my ($left,$right,$prevCenter, $center) = (0, $#$array, -1); while(1) { $center = int (($left + $right)/2); if ($query == $array->[$center]) { if ($begin == 1) { $center-- until $center == 0 or $array->[$center] != $array->[$center-1]; } else { $center++ until $center == $#$array or $array->[$center] != $array[$center+1]; } return $center; } if ($center == $prevCenter) { return $right if $begin == 1; return $right-1 if $begin == 0; } $right = $center if $query < $array->[$center]; $left = $center if $query > $array->[$center]; $prevCenter = $center; } }
    This should save a lot of memory. It now takes about 10 seconds running on 5 million ints for me.

    blokhead

      Did you try one query (2seconds) or ~500 queries. For about ~500 queries, it takes about 5-7 minuetes. I think this is way too slow. Because, I need to make about 100,000 queries, which takes 1000 minutes (16 hours)....
        "One thing I noticed is that you pass in the array by reference (good idea), but the first thing you do is copy it entirely into a local variable. Why not keep accessing it by reference, like this: (I changed some other things around too)"


        Wow! I keep accessing it by reference, it runs in no time. Fantastic. Thanks a lot.
Re: My Binary Search Perl code takes forever
by ikegami (Patriarch) on Dec 15, 2007 at 17:09 UTC

    blokhead's analysis sounds true. When dealing with millions of values, Perl uses up all available memory.

    Since you're only using integers, one method to reduce memory usage considerably is to use a string of packed integers. Tie::IntegerArray makes this transparent for you. (It's suppose to, at least. Untested.) Use it in conjunction with blokhead's change to use a reference.

    my @integer_array; tie @integer_array, 'Tie::IntegerArray', bits => 32, signed => 1, size => 1_000_000; # Pre-allocate this many

    By the way, why do you switch from a binary search to a linear search to find the beginning of a match?

      "By the way, why do you switch from a binary search to a linear search to find the beginning of a match?" Since the array could contain duplicate entries, I just want to find the first (if begin = 1) or last (if begin=0) of the entries. Is there a better way to do it that to search linearly after I find my query?

        Yes, stick with the binary search. Just change the terminating condition from

        $query == $array->[$center]

        to

        $query == $array->[$center] && ($begin ? ($center == 0 || $query != $array->[$center-1]) : ($center == $#$array || $query != $array->[$center+1]) )

        You check the condition a second and third time later down, so you need to fix those too.

        All together:

        sub binarySearch { my ($begin, $query, $array) = @_; return 0 if $query < $array->[0] and $begin == 1; return $#$array if $query > $array->[$#$array] and $begin == 0; my ($left, $right, $prevCenter) = (0, $#$array, -1); while(1) { my $center = int (($left + $right)/2); my $cmp = $query <=> $array->[$center] || ($begin ? ($center == 0 || $query != $array->[$center- +1] ? 0 : -1) : ($center == $#$array || $query != $array->[$center+ +1] ? 0 : +1) ); return $center if $cmp == 0; if ($center == $prevCenter) { return $right if $begin == 1; return $right-1 if $begin == 0; } $right = $center if $cmp < 0; $left = $center if $cmp > 0; $prevCenter = $center; } }

        Update: Added code.

Re: My Binary Search Perl code takes forever
by Limbic~Region (Chancellor) on Dec 15, 2007 at 19:29 UTC
    Anonymous Monk,
    I wrote the following code without having looked at yours just to see how long it took:
    #!/usr/bin/perl use strict; use warnings; my @array = 1 .. 3_141_592; for (1 .. 400) { my $idx = bin_search(\@array, int(rand 3_141_592) + 1); } sub bin_search { my ($list, $tgt) = @_; return 0 if $tgt < $list->[0]; return $#$list if $tgt > $list->[-1]; my ($beg, $end, $val) = (0, $#$list, undef); while ($beg <= $end) { my $mid = int(($beg + $end) / 2); $val = $list->[$mid]; if ($val > $tgt) { $end = $mid - 1; } elsif ($val < $tgt) { $beg = $mid + 1; } else { return $mid; } } return -1; }

    It takes less than 1.5 seconds to run. Additionally, it takes almost exactly the same time if I don't run the bin_search(). In other words, the 400 searches take a negligible amount of time and the majority of the 1.5 seconds is building the array.

    Now that I have looked at your code, I recognize it doesn't do everything you want. Modifying it shouldn't be difficult. Elsewhere in the thread you mention duplicates. If this array needs to be searched many times and will not change between searches, perhaps you should hash the index of the first position of each unique integer. Without knowing more about the problem - it is hard to offer optimization suggestions.

    Cheers - L~R

      Awesome!!Completed in 55 seconds compared to my earlier code which took 6 hours!!