Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Monks,

I have a strange bug and I don’t know how to track it down…Hopefully I am missing something simple *cough* - so please, feel free to ridicule me:)

I have a large XS module that has an extensive test suite. The module itself processed lots of numeric data (double) with the data and the results passed between Perl and C via the XS layer. The module is for Win32 and Linux.

One of the tests when run under Linux has been failing for a long time. I have always thought that this failure was a simple rounding issue, and as such wasn’t important (my first mistake!). As more functionality and tests have been added to the module, more of the tests are starting to fail under Linux. All these failures are slight variations in the expected results on numeric values (such as 6.542 rather that 6.5). When the module is built with no optimization (-O0) on Linux, the error goes away. When any other optimization are used (1 through 3), it always fails.

To confuse things further, regardless of the optimization level when the failing tests run under valgrind the test pass (with no valgrind warning/error).

The same test suite always works under windows (mingw 3.6.x and 4.x) regardless of the optimization level. I have tried several Linux distros, different versions of Perl (5.8.x), and different versions of GCC (3.x and 4.x) and played with the fast math flags, but the results are always the same: Any optimization flag other than –O0 the test suite fails (but always works when run under valgrind!). As this error is becoming more common, I really need to understand what is causing it and ideally fix the underlying problem (which is likely to be my code).

I am at a loss at how to track it down and am not sure what to try next? As the module is large, and the datasets even larger, I don’t really want to step through a debugger or add lots of print statements – but that is all I can think of?

Replies are listed 'Best First'.
Re: XS optimization bug?
by BrowserUk (Patriarch) on Apr 09, 2009 at 15:17 UTC

    The first step would be to determine whether the value is different before you pass it through the XS interface back to Perl. Pick the smallest test that demonstrates the problem and add code to dump the value(s) to STDERR (or a file) within the C subroutine just prior to returning them.

    I'd suggest dumping the 8-byte doubles in raw hex (or even binary), rather than formatted ascii, as different implementations of (s)printf can introduce changes that cloud the issues.

    If you can isolate where the issues arise, it may give you a clue as to what causes it.


    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.
      Thanks to all that have commented. The Linux perl's weren't build with -Duselongdouble and the GCC versions are the same (well, the major points are the same). The same CPU is being used, albeit with the Linux version running under VM. I have tried other machines/OS/GCC versions and I get the same results.

      I have started to create smaller test cases. The following is a perl print out from the C structure when run under win32 (regardless of O level) and Linux (-O0) and Linux (regardless of O level) when ran via Valgrind:

      170 : -25 171 : 9 172 : -28.0000000000005 173 : 25 174 : -156 175 : 26.0000000000005 176 : -50 177 : 34.0000000000005 178 : 36 179 : -10 180 : 22 181 : -70

      The following is a perl print out from the C structure when ran under Linux when compiled with optimization (1,2 or 3):

      170 : -25 171 : 8.99999999999963 172 : -28.0000000000001 173 : 25 174 : -156 175 : 26.0000000000005 176 : -50 177 : 34.0000000000003 178 : 37.0000000000005 179 : -10.0000000000001 180 : 21.9999999999999 181 : -70.0000000000003

      I would expect to see differences of 0000000000003, but if you look at line 178 you'll notice a much larger difference and it's this that is causing the test script to fail. As yet, I dont know what is causing this difference...

Re: XS optimization bug?
by roboticus (Chancellor) on Apr 10, 2009 at 13:08 UTC

    If you're using the same compiler and same optimization setting and same CPU architecture on both systems (e.g., gcc), then I'm surprised.

    Otherwise, you could be tripping over differences in how optimizations are selected/performed. Or you might be having trouble with some common problems encountered with numerical methods: rounding error, loss of significance, error propagation or some other(s) I haven't heard about.

    ...roboticus
      If you're using the same compiler and same optimization setting and same CPU architecture on both systems (e.g., gcc), then I'm surprised.

      Yes - my feeling, too. I did wonder whether perhaps the linux perls had been built with -Duselongdouble (it's certain that the Windows perls weren't). But that's not very likely, and I can't see how the optimisation level would play a part even if the linux perls *had* been built with -Duselongdouble.

      Still, I thought I'd mention this in the off-chance that it's relevant ....

      Cheers,
      Rob