in reply to Re: Testing methodology
in thread Testing methodology

Firstly, as I already said. Thank you for stepping up. I note that the 'big guns' have ducked & covered, presumably keeping their power dry.

I've spent the best part of yesterday responding to your test suite section by section, and I just discarded most of it because it would be seen as picking on you, rather than targeting the tools I am so critical off.

  1. # Test that Q.pm actually compiles.
    BEGIN { use_ok 'Q' };

    What is the purpose of this test?

    • What happens if the test is included and the Q module is not loadable?

      The Test::* modules trap the fatal error from Perl, and so the test suite continues to run, failing every test.

      Not useful.

    • What happens if we do a simple use Module; instead.

      We get two lines of output instead of 6. The lines aren't preceded by comment cards so my editor does not ignore them. The test stops immediately rather than running on testing stuff that cannot possibly pass; or is in error if it does.

      Useful.

    • What is actually being tested here?

      That perl can load a module? If it couldn't the Test::* tools wouldn't load.

      Not useful.

      That the tarball unpacked correctly? Perl would tell me that just as reliably.

      No more useful than Perl.

      That the module installed correctly? No. Because when the test suite is run, the module isn't installed. It is still in the blib structure.

      Not useful.

    • And why does it force me to put it in a BEGIN{}? Because without it, I'd have to use parens on all my method calls otherwise they may be taken as filehandles or barewords.

      Worse than non-useful. Detrimental. Extra work because of changed behaviour.

  2. # new_ok
    my $q = new_ok Q => [5];

    This apparently tests whether the return object is of the same class as the name supplied for the class. Why?

    That prevents me from doing:

    package Thing; use if $^O eq 'MSWin32' ? 'Thing::Win32' : 'Thing::nix'; sub new { $^O eq 'MSWin32' ) ? &Thing::Win32->new() : &Thing::nix->new +(); }

    Detrimental. Extra work; limits options.

  3. # Test that the API as documented still exists.
    can_ok $q => 'nq'; can_ok $q => 'dq'; can_ok $q => 'n';
    • What do get if we use this and it fails?
      not ok 1 - async::Q->can('pq') # Failed test 'async::Q->can('pq')' # at -e line 1. # async::Q->can('pq') failed

      Four lines, three of which just repeat the same thing in different words. And the tests continue despite that any that use that method will fail.

      No benefit. verbose. Repetitive.

    • And if we let Perl detect it?
      Can't locate object method "pq" via package "async::Q" at -e line 1.

      One line, no comment card. No repetition.

    Pointless extra work for no benefit.

  4. The rest elided.

Again, thank you for being a willing subject. Now's your chance for revenge :) Take it!

Here is my module complete with its test suite:

#! perl -slw use strict; package async::Q; use async::Util; use threads; use threads::shared; use constant { NEXT_WRITE => -2, N => -1, }; sub new { # twarn "new: @_\n"; my( $class, $Qsize ) = @_; $Qsize //= 3; my @Q :shared; $#Q = $Qsize; @Q[ NEXT_WRITE, N ] = ( 0, 0 ); return bless \@Q, $class; } sub nq { # twarn "nq: @_\n"; my $self = shift; lock @$self; for( @_ ) { cond_wait @$self until $self->[ N ] < ( @$self-2 ); $self->[ $self->[ NEXT_WRITE ]++ ] = $_; ++$self->[ N ]; $self->[ NEXT_WRITE ] %= ( @$self - 2 ); cond_signal @$self; } } sub dq { # twarn "dq: @_\n"; my $self = shift; lock @$self; cond_wait @$self until $self->[ N ] > 0; my $p = $self->[ NEXT_WRITE ] - $self->[ N ]--; $p += @$self -2 if $p < 0; my $out = $self->[ $p ]; cond_signal @$self; return $out; } sub n { # twarn "n: @_\n"; my $self = shift; lock @$self; return $self->[ N ]; } sub _state { # twarn "_state: @_\n"; no warnings; my $self = shift; lock @$self; return join '|', @{ $self }; } return 1 if caller; package main; use strict; use warnings; use threads ( stack_size => 4096 ); use threads::shared; use async::Util; use Time::HiRes qw[ time sleep ]; our $SIZE //= 10; our $N //= 1e5; our $T //= 4; ++$T; $T &= ~1; my $Q1_n = new async::Q( $SIZE ); my $Qn_n = new async::Q( $SIZE ); my $Qn_1 = new async::Q( $SIZE ); my @t1 = map async( sub{ $Qn_n->nq( $_ ) while defined( $_ = $Q1_n->dq + ); } ), 1 .. $T/2; my @t2 = map async( sub{ $Qn_1->nq( $_ ) while defined( $_ = $Qn_n->dq + ); } ), 1 .. $T/2; my $bits :shared = chr(0); $bits x= $N/ 8 + 1; my $t = async{ while( defined( $_ = $Qn_1->dq ) ) { die "value duplicated" if vec( $bits, $_, 1 ); vec( $bits, $_, 1 ) = 1; } }; my $start = time; $Q1_n->nq( $_ ) for 1 .. $N; $Q1_n->nq( (undef) x ($T/2) ); $_->join for @t1; $Qn_n->nq( (undef) x ($T/2) ); $_->join for @t2; $Qn_1->nq( undef ); $_->join for $t; my $stop = time; my $b = unpack '%32b*', $bits; die "NOK: $b : \n" . $Q1_n->_state, $/, $Qn_n->_state, $/, $Qn_1->_sta +te unless $b == $N; printf "$N items by $T threads via three Qs size $SIZE in %.6f seconds +\n", $stop - $start; __END__ C:\test>perl async\Q.pm -N=1e4 -T=2 -SIZE=40 1e4 items by 2 threads via three Qs size 40 in 5.768000 seconds C:\test>perl async\Q.pm -N=1e4 -T=20 -SIZE=40 1e4 items by 20 threads via three Qs size 40 in 7.550000 seconds C:\test>perl async\Q.pm -N=1e4 -T=200 -SIZE=400 1e4 items by 200 threads via three Qs size 400 in 8.310000 seconds

You'll notice that in addition to performing a default test, it can be configured through command line parameters to vary the key parameters of the test.

The actual test consists of setting up 3 queues. One thread feeding data via the first queue to a pool of threads (1 to many). That pool dequeues the input and passes on to a second pool of threads via the second queue (many to many). And finally those threads pass the data back to the main thread via the third queue (many to 1).

The data for a run consists of a simple list of integers. Once they make it back to the main thread, they are checked off against a bitmap tally to ensure that nothing is dequeued twice, nor omitted.

All in one file; no extraneous modules; no extraneous output; completely compatible with any other test tools available, because it is nothing more than a simple perl script.

Feel free to rip it to shreds.


With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.

The start of some sanity?

Replies are listed 'Best First'.
Re^3: Testing methodology
by tobyink (Canon) on Mar 06, 2012 at 10:17 UTC

    What happens if the test is included and the Q module is not loadable?

    Depends why loading fails. If Q.pm parses OK and executes OK but returns false, then the use_ok test will fail but the other twenty tests will still pass.

    None of your tests cover the situation where Q.pm returns false because you never attempt to "use Q" or "require Q".

    Nothing compels you to put a BEGIN { ... } block around it, but as a matter of style (in both test suites and regular code) I tend to make sure all modules load at compile time unless I'm purposefully deferring the load for a specific reason.

    This apparently tests whether the return object is of the same class as the name supplied for the class. Why?

    No it doesn't. It checks that the returned object is of the same class as the name supplied, or a descendent class in an inheritance heirarchy. This still allows you to return Q::Win32 or Q::Nix objects depending on the current OS, provided that they both inherit from Q.

    To have a class method called "new" in Q, which returns something other than an object that "isa" Q would be bizarre and likely to confuse users. Bizarreness is worth testing against. Tests don't just have to catch bugs - they can catch bad ideas.

    But it can catch bug anyway. Imagine Q/Win32.pm contains:

    my @ISA = ('Q');

    Ooops! That should be our @ISA. This test catches that bug.

    can_ok

    Notice none of my further tests touch the "n" method? Well, at least its existence is tested for. If for some reason during a refactor it got renamed, this test would fail, and remind me to update the documentation.

    I don't think any of your tests check the "n" method either. If you accidentally removed it during a refactor, end users might get a nasty surprise.

    A can_ok test is essentially a pledge that you're not going to remove a method, or not without some deliberate decision process.

    Use of a formalised testing framework can act as a contract - not necessarily in the legal sense - between the developer and the end users. It's a statement of intention: this is how my software is expected to work; if you're relying on behaviour that's not tested here, then you're on dangerous ground; if I deviate from this behaviour in future versions, it will only have been after careful consideration, and hopefully documentation of the change.

    ☆ ☆ ☆

    Overall most of your complaints around Test::More seem to revolve around three core concerns:

    1. Verbosity of output;
    2. That is continues after a failure has been detected rather than bailing out; and
    3. It apparently "forcing you to jump through hoops".

    Verbosity of output has never been as issue for me. The "prove" command (bundled with Perl since 5.8.x) gives you control over the granularity of result reporting: one line per test, one line per file, or just a summary for the whole test suite.

    Yes, you get more lines when a test fails, but as a general rule most of your tests should not be failing, and when they do, you typically want to be made aware of it as loudly as possible.

    The fact that test running continues after a failure I regard as a useful feature. Some test files are computationally expensive to run. If lots of calculations occur, then a minor test of limited importance fails, I still want to be able to see the results of the tests following it, so if there are any more failures I can fix them all before re-running the expensive test file.

    If a particular test is so vital that you think the test file should bailout when it fails, it's not especially difficult to add or BAIL_OUT($reason) to the end of the test.

    my $q = new_ok Q => [5] or BAIL_OUT("too awful");

    Test::Most offers the facility to make all tests bail out on failure, but I've never really used Test::Most.

    One man's "forced to jump through hoops" is another man's "saved from writing repetitive code".

    new_ok saves me from writing:

    my $q = Q->new(5); unless (blessed $q and $q->isa('Q')) { warn "new did not return an object which isa Q"; # and note that the line number reported by "warn" here # is actually two lines *after* the real error occurred. }

    Ultimately if I did ever feel like a particular set of tests wasn't a natural fit for Test::More, there would be nothing to stop me sticking a few non-TAP scripts into my distro's "t" directory, provided I didn't name them with a ".t" at the end. They can live in the same directory structure as my other tests; they just won't get run by "prove" or "make test", and won't be reported on by CPAN testers. It doesn't have be an either/or situation.

      Okay, you're a cool-aid drinker. S'cool.

      I'm not. I don't like the taste.

      The only thing I'll respond to is:

      I don't think any of your tests check the "n" method either.

      True, for a reason: I cannot think of a good use for it. As such it may well go away if I don't find a use for it between now and releasing it. If I ever do.

      The only use I make of Thread::Queue::pending(), relates to preventing the Q from attaining an unbounded size. My queue addresses that internally, so that use goes away.

      Another possible use would be to prevent code from calling dq() when it would block. But as discussed elsewhere, that use is a bust because the information can be out-of-date by the time I get it.

      If there was a use-case for a dq_nb() or similar, then it would have to be implemented internally -- with the test under locks. If I find myself wanting that facility then I'll add it (under some name) and probably drop n().


      With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
      Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
      "Science is about questioning the status quo. Questioning authority".
      In the absence of evidence, opinion is indistinguishable from prejudice.

      The start of some sanity?

Re^3: Testing methodology (TAP++)
by tye (Sage) on Mar 06, 2012 at 05:26 UTC

    Yeah, I eventually realized that there was almost nothing in Test::More that I was actually using or even found to be a wise thing to use. isa_ok()? I just can't imagine that finding a real mistake. I can certainly see it complaining about an implementation detail that I might change.

    I also don't want to use is_deeply(). I think a test suite should complain about what it cares about. If there is extra stuff that it doesn't care about, then it shouldn't complain.

    And I find Test::Simple to just be a stupid idea.

    But I do use and value Test (though I use my own wrapper around it for a few minor reasons and to provide decent Lives() and Dies() implementations -- better than the several modules that purport to provide such that I've seen). I certainly make frequent use of skip()ing.

    The fact that Test doesn't use Test::Builder is a minor side benefit that becomes a major benefit every so often when I feel the need to look at the source code. Test::More's skip() is so bizarrely defined that I can't actually use it correctly without reading the code that implements it and the act of trying to find said code is so terribly aggravating since Test::Builder is involved, so I'm happy to have realized that I never have to go through that again.

    There are tons of tools built on top of TAP (and other testing schemes such as used by some of our Ruby-based tests). It is actually useful in the larger context for each individual test to get numbered so we can often correlate different failure scenarios and to make concise reports easy.

    And we have more than one test file per code file in many cases. This is especially useful when there are interesting set-up steps required for some tests. Testing leaf modules is the easiest case and usually doesn't really stress one's testing chops.

    Many of my test files abstract a few patterns of test and then run lots of simple tests that are specified with a small amount of data. So, for example, I might have a few dozen lines where each line specifies an expected return value, a method name, and an argument list (and maybe a test description).

    Also, having the test code in the same file as the code being tested would complicate coverage measurement, easily distinguishing commits that are fixing code from commits that are fixing tests, searching for real uses of a specific feature while ignoring tests that make use of it, ...

    But, no, I'm not interested in "stepping up" to your challenge. Many of my reasons would just come across as a personal attack so I'll not go into them. But most of what I'm talking about I can't demonstrate well by pasting a bit of code. I have no interest in trying such a feat.

    - tye        

      There are tons of tools built on top of TAP.

      I just don't get what people get from TAP.

      As a (module/application) user, I don't give a monkeys what passed or failed. Either it passed or it didn't. Nor do I (as a Win32 user) give a flying fig for whether you skipped a thousand tests because I'm not on *nix.

      As a (module/application) programmer, if 90% passed is acceptable, then 10% of the tests are useless.

      If I wrapped ok() around my 'has this value been dequeued before' test, I'd be producing 100,000 (or a 1,000,000 or 100,000,000) OKs.

      Even if the user has configured a tool to suppress or summarise that useless information, it still means 100,000 (...) calls to a function to produce useless output; and 100,000 ( ... ) IOs to the screen or pipe, and 100,000 ( ... ) checks in the harness to throw away what I don't want to start with. My testing therefore takes 10 times as long for no benefit.

      Why do you care about the performance of tests, I can hear some somebodies asking -- especially as I dissed their time/cpu usage statistics. But the problem is, IO goes through the kernel and is (often) serialised. And that completely screws with the statical legitimacy of my testing strategy.

      I have at least half a dozen different implementations of a bounded Q. Some pure perl like this one. Some (in XS) that bypass Perl's Win32 emulation of *nix cond_* calls and use (Win32) kernel locking and synching constructs direct. Some (in C/assembler) that bypass even those and implement locking using cpu primitives.

      Many of them are, or have been at some points, incorrectly coded and will deadlock or live lock. But in almost every case when that happens, if I introduce a few printf()'s into the key routines, they perform perfectly. Until I remove them again or (for example) redirect that trace output to NULL. And then they lock again.

      The reason is that the multi-threaded C-runtime performs it own internal locking to prevent it from corrupting its own internal structures. And those locks can and do prevent the timing conditions that cause the hangs.

      So, for me at least, not only do I not see any benefit in what TAP does, the output it requires can completely corrupt my testing.

      It is actually useful in the larger context for each individual test to get numbered so we can often correlate different failure scenarios and to make concise reports easy.

      As the developer receiving an error report, the first thing I'm going to want to do is convert the 'test number' to the file/linenumber. Why bother producing test numbers in the first place? Just give the user file&line and have him give that back to me.

      The only plausible benefit would be if the test number were somehow unique. That is, if the number of the test didn't change when new tests were added or old ones were removed. Then I might be able to respond to reports from old versions. But that isn't the case.

      And we have more than one test file per code file in many cases. This is especially useful when there are interesting set-up steps required for some tests.

      Hm. Unit tests, test the unit. System, integration and regression tests are different and live in a different place.

      I'm having a hard time envisaging the requirement for "interesting set-ups" for unit testing.

      Many of my test files abstract a few patterns of test and then run lots of simple tests that are specified with a small amount of data.

      Isn't that exactly what my 'has this value been dequeued before' test is doing? (I re-read the para many times and I'm still unsure what you mean?)

      my $bits :shared = chr(0); $bits x= $N/ 8 + 1; my $t = async{ while( defined( $_ = $Qn_1->dq ) ) { die "value duplicated" if vec( $bits, $_, 1 ); vec( $bits, $_, 1 ) = 1; } };

      I see no benefit at all in counting those as individual tests. Much less in allowing the test suite to continue so that the one failure gets lost in the flood of 99,999:

      D'ok 1 - got 1 from queue D'ok 2 - got 2 from queue D'ok 3 - got 3 from queue D'ok 4 - got 4 from queue D'ok 5 - got 5 from queue D'ok 6 - got 6 from queue D'ok 7 - got 7 from queue D'ok 8 - got 8 from queue D'ok 9 - got 9 from queue ... D'ok 99996 - got 99996 from queue D'ok 99997 - got 99997 from queue D'ok 99998 - got 99998 from queue D'ok 99999 - got 99999 from queue D'ok 100000 - got 100000 from queue

      (D'oh! Preachin' agin. Sorry! :)

      Also, having the test code in the same file as the code being tested would complicate coverage measurement,

      Maybe legit on a large collaborative project. But I still maintain that if I need a tool to verify my coverage, the module is too damn big.

      Update: split this quote out from the previous one; and responded separately

      easily distinguishing commits that are fixing code from commits that are fixing tests, searching for real uses of a specific feature while ignoring tests that make use of it, ...

      And I do not see the distinction here either. Test code is code. You have to write it, test it and maintain it. The bug fix that fixed the incorrectly coded test that was reporting spurious errors, is just as legitimate and important as the one that fixed the code under test that was reporting legitimate errors. Treating them in some way (actually, anyway) different is a nonsense.

      And this, (dare I say it?), is my biggest problem with TDD:"The Franchise". It actively encourages and rewards the writing of reams and reams of non-production code. And in most cases, does not factor that code into the costs and value of the production product.

      Try going to your National Project Coordinator, (due in Parliament the following week to explain to the Prime Minister why the project is late and over budget), that the reason everything worked during in-house testing and went belly-up on the first day during the high-profile, closely monitored, €18 million pilot study, was because all the in-house tests had been run with the debug-logging enabled, and that so completely distorted the timing requirements that nobody believed you that in critical areas, the overzealous use of over-engineered OO techniques meant that there was no way that it could keep up with full-production scale loading. The logging was effectively serialising inbound state changes, so nothing broke.

      But, no, I'm not interested in "stepping up" to your challenge.

      From what you've said about (at least some) of the test tools I'm critiquing, you would not have been the right 'big gun' for my purpose anyway.

      Many of my reasons would just come across as a personal attack so I'll not go into them.

      That is a shame. (For me!)

      I don't feel that I respond 'hurt' to critiques of my code. I may argue with conclusions and interpretations; but (I like to think), because of my disagreement with your technical assessment of that code.

      But when you start pseudo-psychoanalysing me on the basis of my code -- or words -- and start attributing their deficiencies (as you see them) to some personality trait indicative of some inherited mental condition, rather than as typos, misunderstandings or dog forbid, mistakes I will take umbrage and will respond in kind.

      This is where we have always clashed.

      But most of what I'm talking about I can't demonstrate well by pasting a bit of code. I have no interest in trying such a feat.

      And, as is so often the case, the most interesting part of your response leaves me with a million questions and wanting more...


      With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
      Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
      "Science is about questioning the status quo. Questioning authority".
      In the absence of evidence, opinion is indistinguishable from prejudice.

      The start of some sanity?

        If I wrapped ok() around my 'has this value been dequeued before' test, I'd be producing 100,000 (or a 1,000,000 or 100,000,000) OKs.

        What a stupid idea. And not one that I saw anybody suggest.

        Hm. Unit tests, test the unit. System, integration and regression tests are different and live in a different place.

        I'm having a hard time envisaging the requirement for "interesting set-ups" for unit testing.

        Well, I guess you haven't done much interesting unit testing? As I said, testing leaf modules is relatively trivial. Unit testing non-leaf modules can get tricky and there can be interesting set-up required so you can mock out things that the non-leaf module depends on so that you test the unit, not the whole system it employs.

        But even something relatively trivial and a leaf module like File::FindLib required several interesting set-ups that would be a huge pain to do from a single file. So I'm not particularly moved by your failure of imagination on that point.

        Many of my test files abstract a few patterns of test and then run lots of simple tests that are specified with a small amount of data.
        Isn't that exactly what my 'has this value been dequeued before' test is doing? (I re-read the para many times and I'm still unsure what you mean?)

        Perhaps you should have moved on to the next sentence? "So, for example, I might have a few dozen lines where each line specifies an expected return value, a method name, and an argument list (and maybe a test description)." That is so very much not "call the same method 10,000 times expecting the same result each time, reporting 'ok' separately for each call". I don't see how one can confuse the two so I won't waste time trying to restate that.

        But then, I don't really consider what you keep talking about as a unit test. It is a functional test that has no reproducibility and relies on external interruptions in hopes of randomly inducing a problem. Yes, if I had to write a thread queue module, it is a test I would run but it would not be in the main "unit tests".

        The unit tests would cover the parts of the unit that can be tested in a controlled manner. Does trying to dequeue from an empty queue block? Does trying to enqueue to a full queue block? If I enqueue two items, do they dequeue in the expected order? They'd catch those common off-by-one errors, for example.

        Part of the point of the unit tests is that they get run nightly and whenever a commit is pushed and before a revision gets passed to QA. A test that requires a CD be played so we get some real hardware interrupts isn't usually part of that mix.

        And, yes, I have written thread queues and written tests for them. And, in trying to test for the interesting failures particular to such code, I wrote tests similar to what you wrote, tests that more resemble a load test than a unit test. But, in my experience, the load-like tests were pretty useless at finding bugs, even when running lots of other processes to try to add random interruptions to the mix. Running multiple types of load mixes with no failures would not mean that we wouldn't run into bugs in Production. And a bug being introduced was usually more usefully pointed out by the reproducible tests than by the "when I run the load test it fails".

        Unit testing sucks at helping with the interesting failures of things like thread queues. But that also goes back to why I don't use threads much any more. I prefer to use other means that have the benefit of being easier to test reliably.

        But I still maintain that if I need a tool to verify my coverage, the module is too damn big.

        I don't technically need a coverage tool to tell me which parts of one module isn't covered. But it is very convenient. And, yes, it saves a ton of work when dealing with hundreds of modules that a dozen developers are changing every day. Coverage just provides useful reminders about specific lines of code or specific subroutines that got completely missed by the test suite (sometimes developers get rushed, as hard as that is to imagine) and nothing more.

        I rarely have enough time on my hands that I consider reading through hundreds of modules and hundreds of unit tests trying to notice which parts of the former got missed by the latter. And when I'm working on a tiny leaf module, I still figure out "which part did I not test at all yet" by running a command that takes maybe a few seconds to tell me rather than taking a minute or few to swap in every tiny feature of the module and every tiny step tested and perform a set difference in my head.

        Particular bad ideas with coverage include: thinking that 100% (or 99%) coverage really means that you've got good test coverage; shooting for 100% (or 99%) coverage as a goal in itself rather than as a tool for pointing out specific missed spots that either shouldn't be tested or should be carefully considered for how they should be tested; adding a stupid test because it causes a particular line to get 'covered'.

        Update's response:

        And I do not see the distinction here either. Test code is code. You have to write it, test it and maintain it.

        No, I don't write tests for my test code. I run my test code. That has the side effect of testing that the test code actually runs. If you call that testing the test code, then you must have some confusing conversations.

        And I don't have to maintain code if it isn't being used any longer. But all of my code is used by at least one test file. So, if I don't distinguish then I can't tell that a feature is no longer used (other than by being tested) and so can just be dropped.

        And no change to test code is going to cause a failure in Production nor surprise a customer. So who cares about which changes and in what ways is very different between changes to real code and changes to test code.

        - tye