hv has asked for the wisdom of the Perl Monks concerning the following question:

I have two questions here, one about what I want to do, the other about what I'm currently doing.

I'm writing tests for an early-stage mathematical proof checker (axiom). In some cases, there are tests for which there are multiple results that would class as "correct", but some results are better than others.

For example, when stringifying a mathematical expression such as a-b, the result 'a(-b)' would be "not ok", but any of 'a+(-b)', 'a+-b', 'a-b' would be "ok" - and with some more ok than others.

Has anybody implemented a test regime that accounts for this sort of thing? The only approach I've thought of so far is some sort of "quality" metric, independent of pass/fail statistics, wherein "more ok" results push the quality metric up. It is not obvious to me how to quantify that though, or how results from individual tests should combine.

Secondly, I'm currently trying to implement these as TODO tests, using Test::More and prove. However when I run a test file individually, a failing TODO test stands out like a true fail; and when I run under prove it just gives output like:

t/expr-stringify.t .. ok All tests successful. Files=1, Tests=29, 0 wallclock secs ( 0.03 usr 0.00 sys + 0.31 cusr + 0.04 csys = 0.38 CPU) Result: PASS

.. which gives no indication that there were failing TODO tests.

What are people's preferred way to get a clean indication of "all passing, but with $n failing TODO tests"?

Replies are listed 'Best First'.
Re: Testing: shades of grey
by bliako (Abbot) on Dec 24, 2024 at 10:15 UTC
    'a+(-b)'

    An idea is to punish it (deduct from the metric) when it violates the basic presentation rules we all learned in primary school. E.g. superfluous bracketing. Or failing to simplify +-5 to -5. Assign to each violation a significance to be translated to a punishment number. Actually as it is, it seems that you will soon go down to negative metric. So, convert your metric to "unfitness" and keep adding the penalties. The smaller the metric the fitter the solution (aesthetically speaking). And also there could be different metrics, like aesthetics, terseness, pedantic. And you combine them, with weights, in a multidimensional overall metric.

    The other idea is to use other proof checkers and see how your output compares to theirs in terms of presentation of final result. Unless yours is far superior than anything else out there and it excels in each and every metric!

    Lastly, consider this idea as an addition to your project: the idea is to additionally create a Markov Chain mathematical expression generator trained on freely available mathematical expressions which are assumed to be "perfect" as far as aesthetics are concerned. So you will also need a parser which I guess you already have. I guess mathematical expressions with the best quality would be from parsing latex equations from papers in, say, https://arXiv.org. Which, !!!hail open source!!!, offer the paper source along with pdf and postscript, e.g. https://arxiv.org/src/2412.17766. So, from my comfortable sofa: latex equation parser and a markov-chain mathematical expression generator.

    Along these lines there is this n-dimensional statistical analysis of DNA sequences (or text, or ...) which resulted in a package I submitted here: https://github.com/hadjiprocopis/Algorithm-Markov-Multiorder-Learner

    bw, bliako

      I'm no expert on Markov chains, but I'm aware of PDL::Ngrams which might help with the Markov algorithm, particularly performance.
Re: Testing: shades of grey
by karlgoethebier (Abbot) on Dec 21, 2024 at 19:20 UTC
Re: Testing: shades of grey
by hv (Prior) on Dec 24, 2024 at 16:48 UTC

    Thank you all for your comments

    With regard to the second question, the current verbose output looks like:

    ok 12 - expr -a+b ok 13 - expr 1/a ok 14 - expr a.b not ok 15 - expr a/b # TODO # Failed (TODO) test 'expr a/b' # at t/expr-stringify.t line 58. # got: 'a.(1/b)' # expected: 'a/b' ok 16 - expr a.(1/b) ok 17 - expr (1/a).b ok 18 - expr a^2

    .. which I cannot trivially eyeball (certainly once there are 100s of tests) to distinguish true FAILs from TODOs. I could live with eyeballable verbose output looking like:

    ok 12 - expr -a+b ok 13 - expr 1/a ok 14 - expr a.b not ok 15 - expr a/b # TODO ok 16 - expr a.(1/b) ok 17 - expr (1/a).b ok 18 - expr a^2

    .. but would prefer non-verbose prove output that in some way (any way) mentioned "and 1 TODO". I understand that App::Prove is intended to be extensible, so I was hoping somebody would already have done this or have known how to do it, but I guess I'll need to dig in to that.

    For the first question, I will have to take some time to absorb all the suggestions. Certainly I will start with [LanX]'s suggestion to put all OK results into one test - maybe I'd actually put them in an array rather than a hash, which gives me a free (if low-quality) ranking. But I need to think more about how to decide "OKness" scores for each of the results, and significance (weighting) scores for one test versus another.

    My code only implements the mathematical constructs and the transformation rules that I have needed so far, so I don't think I can usefully test it at this point against other proof checkers, nor usefully verify it against scraped sources of LateX expressions. That said, if Lean been designed to be useful with a command line and text files I probably wouldn't be developing this.

      I think you should read this sub thread (again): Re: Testing: shades of grey

      You can

      • bundle test routines into abstract functions
      • fine tune output and behavior with Test::Builder
      • analyse the output with TAP::Parser

      Regarding the weights, I'd start with reasonable defaults unless specified. For instance if you have an order of preferable results, degrade the weight appropriately if better options are missed.

      And I'd group different features in different TODO classes. Like that you can have multi-dimensional quality metrics and are free to combine them into one dimension again.

      All this gives you plenty of possibilities of easy adjustments at a later stage.

      Cheers Rolf
      (addicted to the Perl Programming Language :)
      see Wikisyntax for the Monastery

        Maybe this minimal demo gets you started?

        It's implementing a basic test function "rank" which tests eq and does a linear quality ranking from 1 to 0 against the provided list.

        The output includes a field "quality:"

        You are free to extend this in multiple ways, like

        • providing the internal test function, e.g rank( 'is', ,...)
        • adjusting the ranking curve
        • checking if $TODO is set
        • taking advantage of the todo_output handle to collect metrics separately
        As already said I'd use TAP::Parser to find the quality fields.

        use strict; use warnings; use Test::More; TODO: { local $TODO = "STRINGIFICATION"; rank( 'a-b' => ['a+(-b)', 'a+-b', 'a-b'], "to pass" ); rank( 'a--b' => ['a+(-b)', 'a+-b', 'a-b'], "to fail" ); } sub rank { my ($got, $a_exps, $name) = @_; local $Test::Builder::Level = $Test::Builder::Level + 1; my @exps = @$a_exps; my $pos = 0; my %rank = map { $_ => 1- ($pos++ / @exps) } @exps; my $quality = $rank{$got}; if($quality){ pass($name); diag( sprintf "%12s: %.2f", "quality", $quality ); } else { fail($name); diag( sprintf "%12s: %s", "got","'$got'" ); diag( sprintf "%12s: %s", "expected", join ", ", map {"'$_'"} +@exps ); diag( sprintf "%12s: %.2f", "quality", 0 ); } } done_testing; exit; __END__

        ok 1 - to pass # TODO STRINGIFICATION # quality: 0.33 not ok 2 - to fail # TODO STRINGIFICATION # Failed (TODO) test 'to fail' # at /home/lanx/perl/pm/test_rank.pl line 10. # got: 'a--b' # expected: 'a+(-b)', 'a+-b', 'a-b' # quality: 0.00 1..2

        Cheers Rolf
        (addicted to the Perl Programming Language :)
        see Wikisyntax for the Monastery

Re: Testing: shades of grey
by LanX (Saint) on Dec 21, 2024 at 15:02 UTC
    The metric is in the eye of the beholder. ;)

    I'd store all OK stringifications in a hash and fail if it's not included.

    As an idea:

    If you have different levels of nice-to-have probably try to SKIP / TODO those depending on a global threshold $NICE_LEVEL . Changing it should give you an indication of "more OK". (Disclaimer: never tried this)

    If you log the level for each block, you can try to calculate a rational number metric afterwards.

    Please clarify if I misunderstood your question.

    Cheers Rolf
    (addicted to the Perl Programming Language :)
    see Wikisyntax for the Monastery

    PS: I'm so proud of me to have avoided any pun referencing "50 shades of ..." Oops! ;)

Re: Testing: shades of grey
by etj (Priest) on Dec 22, 2024 at 16:14 UTC
    My understanding of TODO tests is they're all expected to fail, and what's unexpected (and reported on) is for them to pass.

    As for the "how correct" problem, my approach to that situation would be:

    • make a function that evaluates (say on a scale from 0-1) how correct a given expression is, against given criteria
    • make an "ok_ish" function that takes the output of the above, together with a minimum value, and fails if that value drops below a threshold
    That second one would allow spotting regressions in how correct one's parsing was.
        My understanding of TODO tests is they're all expected to fail, and what's unexpected (and reported on) is for them to pass.

      I politely disagree with this. The TODO tests are ones that run, and the result is reported, but whether they pass or not is academic. Maybe they passed, and maybe they failed. We don't care -- it's just 'interesting'. :)

      Eventually the stuff that was TODO gets done, and the TODO can be moved into the realm of real tests, and you move on.

      Alex / talexb / Toronto

      For a long time, I had a link in my .sig going to Groklaw. I heard that as of December 2024, this link is dead. Still, thanks to PJ for all your work, we owe you so much. RIP Groklaw -- 2003 to 2013.

        Yes, but in the context of hv's original question, prove does not by default report TODO tests which fail but it does report TODO tests which pass albeit without counting those towards an overall pass/fail of the test file/suite. That's the relevant part, AIUI.


        🦛

Re: Testing: shades of grey
by InfiniteSilence (Curate) on Dec 22, 2024 at 06:16 UTC

    "...results are better than others..."

    My guess is, from a software engineering standpoint, you are interested in this because you want your solution to show a preference depending on an eventual 'confidence value' which would be expressed as a number between zero and one depending on various input and derived factors.

    Once upon a time this was the core idea behind a branch of computer science that eventually morphed into today's artificial intelligence -- expert system development.

    If interested there is an online working example of such a system called tmycin (written in lisp though) with examples of identifying things like species of snakes or different types of rocks. The user answers a series of prompts while the system computes a confidence value behind the scenes. The end result is a suggestion based on the system's best guess -- a diagnosis or, in your case, the best result from among a list of competitors.

    With some work this could be reimplemented in Perl or you can check CPAN for a lisp interpreter. I am not aware of a test module in CPAN that utilizes this technique but would be happy to have one myself if it indeed does exist.

    Celebrate Intellectual Diversity

Re: Testing: shades of grey
by LanX (Saint) on Dec 23, 2024 at 23:45 UTC
    here another approach:

    repeating the tests according to their importance will create a metric at the end

    use Test::More; sub weight (&$$) { my ($c_test, $name, $weight) = @_; my $res = subtest $name, $c_test; ok($res,$name) for 2.. $weight; } weight { is "a","a" } "PASSING" => 3; weight { is "a","X" } "FAILING" => 2; done_testing;

    # Subtest: PASSING ok 1 1..1 ok 1 - PASSING ok 2 - PASSING ok 3 - PASSING # Subtest: FAILING not ok 1 # Failed test at /home/lanx/perl/pm/test_metric.pl line 16. # got: 'a' # expected: 'X' 1..1 # Looks like you failed 1 test of 1. not ok 4 - FAILING # Failed test 'FAILING' # at /home/lanx/perl/pm/test_metric.pl line 6. not ok 5 - FAILING # Failed test 'FAILING' # at /home/lanx/perl/pm/test_metric.pl line 8. 1..5 # Looks like you failed 2 tests of 5. <--- Metric

    As you see, you are quite flexible with a bit of functional programming.

    Another approach might be using Test::Builder to create your own semantics.

    update
    For instance you can use $Test->no_diag(0); to make the repeated fails less verbose.

    update

    or you can use my @tests = $Test->details; to get a detailed overview which tests passed and if they were todos.

    # test_metric.pl:33: [ # { actual_ok => 1, name => "PASSING", ok => 1, reason => "", type = +> "" }, # { actual_ok => 1, name => "PASSING", ok => 1, reason => "", type = +> "todo" }, # { actual_ok => 1, name => "PASSING", ok => 1, reason => "", type = +> "todo" }, # { actual_ok => 0, name => "FAILING", ok => 0, reason => "", type = +> "" }, # { actual_ok => 0, name => "FAILING", ok => 1, reason => "", type = +> "todo" }, # ]

    after putting the weight into the name, you can easily calculate your desired metric.

    NB: ->details also works locally for subtests only.

    Cheers Rolf
    (addicted to the Perl Programming Language :)
    see Wikisyntax for the Monastery

      Finally you can also use TAP::Parser to scan the output.

      So if I were you, I'd use the demonstrated abstraction in combination with Test::Builder to write your own appropriate test-semantic and put the "weight" into the comments.

      I would than parse the verbose output and calculate the desired "metric" from the weights, with TAP::Parser.

      I wouldn't mess with the classic test result output, cause it's a different "all or nothing" metric apart from the "better" metric.

      So better put non-crucial tests into TODO blocks, because they are not kill criteria.

      Regarding testing things with various level of "ok-ness", I'd write a test which tests them from best to worst and bails out as soon as it passes -i.e. reporting OK-and logging the weight in the comments.

      Like if-elsif-... chains or in case of simple strings hash look-ups with the weight as values.

      A routine like

      isinchain( got, [ [exp1,weight1,name1], etc... ] , name) could do the abstraction

      That's all very abstract because you're description was very abstract too.

      Hope this helps. :)

      Cheers Rolf
      (addicted to the Perl Programming Language :)
      see Wikisyntax for the Monastery