in reply to Testing: shades of grey

Thank you all for your comments

With regard to the second question, the current verbose output looks like:

ok 12 - expr -a+b ok 13 - expr 1/a ok 14 - expr a.b not ok 15 - expr a/b # TODO # Failed (TODO) test 'expr a/b' # at t/expr-stringify.t line 58. # got: 'a.(1/b)' # expected: 'a/b' ok 16 - expr a.(1/b) ok 17 - expr (1/a).b ok 18 - expr a^2

.. which I cannot trivially eyeball (certainly once there are 100s of tests) to distinguish true FAILs from TODOs. I could live with eyeballable verbose output looking like:

ok 12 - expr -a+b ok 13 - expr 1/a ok 14 - expr a.b not ok 15 - expr a/b # TODO ok 16 - expr a.(1/b) ok 17 - expr (1/a).b ok 18 - expr a^2

.. but would prefer non-verbose prove output that in some way (any way) mentioned "and 1 TODO". I understand that App::Prove is intended to be extensible, so I was hoping somebody would already have done this or have known how to do it, but I guess I'll need to dig in to that.

For the first question, I will have to take some time to absorb all the suggestions. Certainly I will start with [LanX]'s suggestion to put all OK results into one test - maybe I'd actually put them in an array rather than a hash, which gives me a free (if low-quality) ranking. But I need to think more about how to decide "OKness" scores for each of the results, and significance (weighting) scores for one test versus another.

My code only implements the mathematical constructs and the transformation rules that I have needed so far, so I don't think I can usefully test it at this point against other proof checkers, nor usefully verify it against scraped sources of LateX expressions. That said, if Lean been designed to be useful with a command line and text files I probably wouldn't be developing this.

Replies are listed 'Best First'.
Re^2: Testing: shades of grey
by LanX (Saint) on Dec 24, 2024 at 23:56 UTC
    I think you should read this sub thread (again): Re: Testing: shades of grey

    You can

    • bundle test routines into abstract functions
    • fine tune output and behavior with Test::Builder
    • analyse the output with TAP::Parser

    Regarding the weights, I'd start with reasonable defaults unless specified. For instance if you have an order of preferable results, degrade the weight appropriately if better options are missed.

    And I'd group different features in different TODO classes. Like that you can have multi-dimensional quality metrics and are free to combine them into one dimension again.

    All this gives you plenty of possibilities of easy adjustments at a later stage.

    Cheers Rolf
    (addicted to the Perl Programming Language :)
    see Wikisyntax for the Monastery

      Maybe this minimal demo gets you started?

      It's implementing a basic test function "rank" which tests eq and does a linear quality ranking from 1 to 0 against the provided list.

      The output includes a field "quality:"

      You are free to extend this in multiple ways, like

      • providing the internal test function, e.g rank( 'is', ,...)
      • adjusting the ranking curve
      • checking if $TODO is set
      • taking advantage of the todo_output handle to collect metrics separately
      As already said I'd use TAP::Parser to find the quality fields.

      use strict; use warnings; use Test::More; TODO: { local $TODO = "STRINGIFICATION"; rank( 'a-b' => ['a+(-b)', 'a+-b', 'a-b'], "to pass" ); rank( 'a--b' => ['a+(-b)', 'a+-b', 'a-b'], "to fail" ); } sub rank { my ($got, $a_exps, $name) = @_; local $Test::Builder::Level = $Test::Builder::Level + 1; my @exps = @$a_exps; my $pos = 0; my %rank = map { $_ => 1- ($pos++ / @exps) } @exps; my $quality = $rank{$got}; if($quality){ pass($name); diag( sprintf "%12s: %.2f", "quality", $quality ); } else { fail($name); diag( sprintf "%12s: %s", "got","'$got'" ); diag( sprintf "%12s: %s", "expected", join ", ", map {"'$_'"} +@exps ); diag( sprintf "%12s: %.2f", "quality", 0 ); } } done_testing; exit; __END__

      ok 1 - to pass # TODO STRINGIFICATION # quality: 0.33 not ok 2 - to fail # TODO STRINGIFICATION # Failed (TODO) test 'to fail' # at /home/lanx/perl/pm/test_rank.pl line 10. # got: 'a--b' # expected: 'a+(-b)', 'a+-b', 'a-b' # quality: 0.00 1..2

      Cheers Rolf
      (addicted to the Perl Programming Language :)
      see Wikisyntax for the Monastery