in reply to Most Significant Set Bit

A binary search should be the 3rd alternative here. First alternative, if the problem space is only 64 bits, is a linear search through 64 bits. Second alternative should probably be a mathematical solution. And the third would be a binary search. It's really hard to make the complexity of a binary search beat the speed of a linear search for a small search set.

It's true that a binary search in a 64 bit unsigned integer range is going to take, at worst, 64 comparisons. And on average, a lot less than 64 comparisons, in a randomized data set of 64 bit numbers. This is O(log(n)). But a linear search through a 64-bit vector to find the most significant bit for an integer, is also O(log(n)); you have at most 64 comparisons with a binary search through 2**64 numbers, or you have at most 64 comparisons if you do a linear search through the bits of a number that fits within 64 bits. And a linear search will be very fast for such a small problem space.

But a mathematical solution to the problem is that the most significant bit of any unsigned integer will be found at index int(log2(n)). So if n is 1, int(log2(n)) is 0 (the zero-index bit; the right-most bit). The int(log2(32767)) is 15. You cannot represent 32767 in fewer than 16 bits. And the most significant bit will be the left-most bit, or the bit at index (offset) 15.

Therefore, a solution that requires NO iteration at all could be:

sub most_significant_ix { my $n = shift; return if $_[0] == 0; return int(log($_[0])/log(2)); }

Unfortunately the log function isn't inexpensive, so the linear search through 64 bits will probably still win, though on paper this solution is O(1), whereas the linear search through bits of an integer, and the binary search through the integer range, will both be O(log(n)) time complexity.

A linear search through the bits solution would look like this:

sub most_significant_ix { my $n = shift; return undef if $n == 0; my $bits = 64; while (--$bits >= 0) { return $bits if (2**$bits) & $n; } }

Dave

Replies are listed 'Best First'.
Re^2: Most Significant Set Bit
by hippo (Archbishop) on Mar 15, 2024 at 23:07 UTC
    Unfortunately the log function isn't inexpensive

    True, but you can halve the expense for large numbers of evaluations by storing log(2), which is just a constant.


    🦛

      Strangely using a stored $log2 = log(2) in the above test did not improve the log function speed. It was actually slightly slower.

      EDIT: similar also with use constant log2 => log(2)

      EDIT2: The interpreter must compile such constant expressions into a constant before running.

        The interpreter must compile such constant expressions into a constant before running.

        Yes

        $ perl -MO=Concise,-exec -e'my $x = log(2);' 1 <0> enter v 2 <;> nextstate(main 1 -e:1) v:{ 3 <$> const[NV 0.693147180559945] s/FOLD 4 <1> padsv_store[$x:1,2] vKS/LVINTRO 5 <@> leave[1 ref] vKP/REFC -e syntax OK
Re^2: Most Significant Set Bit
by NERDVANA (Priest) on Mar 20, 2024 at 22:31 UTC
    Some inaccuracies:
    It's true that a binary search in a 64 bit unsigned integer range is going to take, at worst, 64 comparisons. ... This is O(log(n))

    The initial proposal is a binary search on the bits which (as stated in the original post) takes at most 6 comparisons, not 64. It would be O(log(log(n)))

    But a linear search through a 64-bit vector to find the most significant bit for an integer, is also O(log(n)); .... And a linear search will be very fast for such a small problem space.

    Yes, linear on the number of bits, which is a loop of 64 comparisons, vs. the 6 proposed by OP.
    So, slower.

    Therefore, a solution that requires NO iteration at all could be ... though on paper this solution is O(1)

    Hiding a loop inside a function doesn't make it not-a-loop. Since the conversation is about bits, you can't just assume it as a constant like when you're assuming a 64-bit hardware op. If this were an arbitrary precision number like Math::BigInt, 'log' is definitely not a constant operation.

      Good point on the calculation of log not being O(1).

      How do you do a binary search on '0111110100111001011101010000111101000100111011000000000000000011'? I'm not quite sure what is meant by doing a binary search on the bits. What does the comparator look like? When I suggested that the binary search must be on the integer range, it was because I couldn't envision how a binary search would be applied to efficiently discover the first non-zero bit in a bit field directly. I could see it working fairly well on an integer range, though.


      Dave

        I'm not quite sure what is meant by doing a binary search on the bits

        I mean the exact thing that OP used as an example :-) My phrasing "binary search on the bits" might not be the best name for it; maybe "a log-based binary search"?

        Written in a generic manner, it might look like

        my ($min, $max, $mid)= (0, 62); while ($min < $max) { $mid= int(($min+$max)/2); if ($n < (1 << $mid)) { $max= $mid-1; } else { $min= $mid; } }
        but since we know the range is 64 bit, it can be unrolled as
        if ($n < 0x100000000) { if ($n < 0x10000) { if ($n < 0x100) { if ($n < 0x10) { if ($n < 8) { if ($n< 4) { return $n < 2? 1 : 2;
        and so on.

        Now, I have to retract my earlier statement about analyzing log() in terms of generic-length bit strings, because this binary search does actually depend on greater and lessthan ops being constants. In a variable-length bit string, those would also be loops. Still, I think even for fixed-width 64-bit numbers the log() function is probably implemented as a loop because they have to calculate out the full floating point precision, so it should be at least as expensive as floating point division, which is notoriously slower than the other floating-point operations.

Re^2: Most Significant Set Bit
by Danny (Chaplain) on Mar 15, 2024 at 23:03 UTC
    I timed those two functions along with the following:
    sub by_string { my $n = shift; return length(sprintf "%b", $n) - 1; }
    I tested with high bits using int rand(2**64), medium bits int rand(2**32) and lower bits int rand(2**16) for 1e6 iterations.
    log linear sprintf 0.2917 0.4427 0.3370 avg_sig=62 0.2786 2.8430 0.3056 avg_sig=30 0.2826 4.1060 0.2986 avg_sig=14