The (unposted) code that I wrote lifted the particular M::I128 multiplication that I benchmarked from 371517 multiplications/s to 927835 multiplications/s. That's a bit less than a factor of "3", admittedly. The rest was just my bad memory ... or perhaps deliberate exaggeration ;-)