The basic process at work here is the following
printf "%.17e\n", 1.00000000000000040e+000;; 1.00000000000000040e+000 printf "%.17e\n", 5.00000000000100030e-016;; 5.00000000000100030e-016
printf "%.17e\n", $S = 1.00000000000000040e+000 + 5.00000000000100030e +-016;; 1.00000000000000090e+000
By subtracting the two numbers in turn, from their sum (numerically ought to be zero); but due to the machine limitations, a delta falls out and the lost precision is recovered:
printf "%.17e\n", $S = 1.00000000000000040e+000 + 5.00000000000100030e +-016;; 1.00000000000000090e+000 printf "%.17e\n", $e = ( $S = 1.00000000000000040e+000 + 5.00000000000 +100030e-016 ) - 1.00000000000000040e+000;; 4.44089209850062620e-016 printf "%.17e\n", ( $e = ( $S = 1.00000000000000040e+000 + 5.000000000 +00100030e-016 ) - 1.00000000000000040e+000 ) - 5.00000000000100030e-0 +16;; -5.59107901500374110e-017
That last value is the lost precision that needs to be stored in the second double.
Of course, it doesn't end there. There might already be values in those other (low) doubles; and they need to be added together along with the spillage from the hi order doubles above.
But that calculation itself can result in what might be termed 'overflow' or 'carry-over'; and that needs to be added back into the high order part of the result. But that ...
Hopefully, you get the picture.
The sequences of additions and subtractions in that sub are meant to ensure that borrows from the high order doubles and carries from the low order doubles are sorted out and merged; with the result that you end up with 105/6 bits of precision.
It -- the C++ -- appears to work; plenty of people have used it -- but I want to port (parts of) it to C; and that's caused me to look closely at it. And there is weirdness afoot.
In reply to Re^4: [OT] C++ mystery.
by BrowserUk
in thread [OT] C++ mystery.
by BrowserUk
| For: | Use: | ||
| & | & | ||
| < | < | ||
| > | > | ||
| [ | [ | ||
| ] | ] |