in reply to Re^8: Need more precision. (Updated.)(And again)
in thread Need more precision.

It's been a really long time since I last worked with fixed point arithmetic, so I decided to first do a little refresher on that:

https://www.youtube.com/watch?v=S12qx1DwjVk https://www.youtube.com/watch?v=npQF28g6s_k https://www.youtube.com/watch?v=7pkXlcapNB4

... and now I've run out of time and will have to get back to playing with it later :-/ But the above essentially confirms what I was thinking earlier: convert your floats in the range -4 to 4 into 128-bit ints with a scale of something like 1/2**124; addition and subtraction remain trivial, multiplication either requires a 256-bit output, or you'll have to shift your two multiplicands first and lose some precision.

Your approach is certainly different from that fixed-point approach, but mostly due to lack of time in comprehending it I'm not yet sure that I see where you are going with it other than implementing your own 128-bit floating point numbers?

BTW, in my research I also came across Math::Decimal128 and Math::Float128, although I'm not sure if those will help you in MSVC...

Replies are listed 'Best First'.
Re^10: Need more precision. (Updated.)
by BrowserUk (Patriarch) on Jun 10, 2015 at 01:37 UTC
    Your approach is certainly different from that fixed-point approach, but mostly due to lack of time in comprehending it I'm not yet sure that I see where you are going with it other than implementing your own 128-bit floating point numbers?

    Um. I'm just playing with it at the moment. Looking for the path of least resistance that allows me to get back to what I really want to be doing.

    I reasoned that as I have a 64-bit scale (it costs no more in computation time than using a byte; and I only have 6 of these numbers, so the extra space isn't an issue), it was easier to allow the scale to float a little than getting into doing "proper" fixed point and the need for a 128-bit register.

    That maybe short-sighted, but I've never written anything like this before, so I don't know what limitations I'm going to encounter. A new world to explore :)

    Maybe your links will educate me against it.

    Lot's to watch and read. Many thanks for your help.


    With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority". I'm with torvalds on this
    In the absence of evidence, opinion is indistinguishable from prejudice. Agile (and TDD) debunked

      So I'm still short on time but here is a proof-of-concept for fixed-point arithmetic with 128-bit ints in Perl I threw together based on the video tutorials. The same logic used would apply to any other language as well.

      I've used Math::Int128 as my 128-bit data type here, but if the target language doesn't support 128-bit ints natively, it should be possible to find a library (or implement one oneself) that abstracts out addition, subtraction, multiplication, and shifts on a custom 128-bit integer type (e.g. 2x 64-bit ints).

      Some notes on the code: The Math::Big* modules are for conversion to/from fixed-point only; the idea being that you first bring all your numbers into the fixed-point domain and then do all your work there. The scale of 1/2**124 means there are 4 bits before the decimal point, which is enough for the stated range of -4 to 4. The multiplication method implemented in mul_fixed is the "fast but less accurate" one (shift first, then multiply), because the most accurate one requires a 256-bit output (multiply first, then shift). As explained in the video tutorial, if the multiplicands are in a known range, it's possible to play with the shifting so that less accuracy is lost.

      #!/usr/bin/env perl use strict; use warnings; use Math::BigFloat; use Math::BigInt; use Math::Int128 qw(int128 int128_to_hex); use Math::Int128::die_on_overflow; $\ = $/; # https://www.youtube.com/watch?v=S12qx1DwjVk my $SCALE_BITS = 124; my $SCALE = Math::BigInt->new(2)->bpow($SCALE_BITS); print " Scale: 2**$SCALE_BITS = $SCALE"; sub number2fixed { # Math::Big* --> fixed-pt Math::Int128 return int128( shift->copy->bmul($SCALE)->as_int->bstr ); } sub fixed2number { # fixed-pt Math::Int128 --> Math::BigFloat return scalar Math::BigFloat->new(''.shift)->bdiv($SCALE); } my $EPSILON = fixed2number(int128(1)); print "Epsilon: ", $EPSILON->bstr; print "//// example: Pi"; my $pi1 = Math::BigFloat->bpi(45); print " orig: $pi1"; my $fpi = number2fixed($pi1); print "fixed: ", int128_to_hex($fpi); my $pi2 = fixed2number($fpi); print " conv: $pi2"; # https://www.youtube.com/watch?v=npQF28g6s_k sub mul_fixed { # 2x Math::Int128 --> Math::Int128 return (shift>>($SCALE_BITS/2)) * (shift>>($SCALE_BITS/2)); } print "//// properties of mul_fixed"; print "smallest possible multiplicand: ", fixed2number(int128(1)<<($SCALE_BITS/2)); # demo effect of the smallest possible multiplicand my $mul = number2fixed(Math::BigInt->new(3)); my $toosmall = number2fixed( Math::BigFloat->new("0.000000000000000000216")); print " too small: ", int128_to_hex($toosmall); print "\t* 3 = ", fixed2number(mul_fixed($mul,$toosmall)); my $justenough = number2fixed( Math::BigFloat->new("0.000000000000000000217") ); print "just enough: ", int128_to_hex($justenough); print "\t* 3 = ", fixed2number(mul_fixed($mul,$justenough)); print "//// test case: (45e-10)**2"; my $val = Math::BigFloat->new("0.0000000045"); print " input value: ", $val->bstr; my $fval = number2fixed($val); print " as fixed-pt: ", int128_to_hex($fval); my $p = mul_fixed($fval,$fval); print "fixed-pt mul: ", int128_to_hex($p); print " converted: ", fixed2number($p); print " expected: ", $val->copy->bpow(2);

      Output:

      Scale: 2**124 = 21267647932558653966460912964485513216 Epsilon: 0.00000000000000000000000000000000000004701977403289150031874 +946148888982711275 //// example: Pi orig: 3.14159265358979323846264338327950288419716940 fixed: 3243F6A8885A308D313198A2E0370734 conv: 3.141592653589793238462643383279502884184 //// properties of mul_fixed smallest possible multiplicand: 0.000000000000000000216840434497100886 +8014905601739883422852 too small: 00000000000000003FC07FA1F14C2C48 * 3 = 0 just enough: 0000000000000000400C0E7219869986 * 3 = 0.0000000000000000006505213034913026604044716805219650268555 //// test case: (45e-10)**2 input value: 0.0000000045 as fixed-pt: 00000001353CD652BB1674942F2A17B4 fixed-pt mul: 000000000000001758BEBD8531A08964 converted: 0.000000000000000020249999998198227269026294503547169656 +24 expected: 0.00000000000000002025

      In other words: You've got a little better than 36 decimal places after the point for storage, addition and subtraction. In multiplication, as implemented here, the inputs are currently limited to a bit more than 17 decimal places.

        Thanks for that Anony. I skipped through the videos earlier, (clever guy, but he does have a tendency to ramble a bit :), and your code clarifies stuff that didn't go in from the first viewing.

        Whatever happens, I'm going to have to get to grips with a basic 128-bit integer type, so that'll be my first focus. Then I'll come back to this once I have that working.


        With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        "Science is about questioning the status quo. Questioning authority". I'm with torvalds on this
        In the absence of evidence, opinion is indistinguishable from prejudice. Agile (and TDD) debunked