Testing: shades of grey

hv has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: Testing: shades of grey by bliako (Abbot) on Dec 24, 2024 at 10:15 UTC
`'a+(-b)'` An idea is to punish it (deduct from the metric) when it violates the basic presentation rules we all learned in primary school. E.g. superfluous bracketing. Or failing to simplify `+-5` to `-5`. Assign to each violation a significance to be translated to a punishment number. Actually as it is, it seems that you will soon go down to negative metric. So, convert your metric to "unfitness" and keep adding the penalties. The smaller the metric the fitter the solution (aesthetically speaking). And also there could be different metrics, like aesthetics, terseness, pedantic. And you combine them, with weights, in a multidimensional overall metric. The other idea is to use other proof checkers and see how your output compares to theirs in terms of presentation of final result. Unless yours is far superior than anything else out there and it excels in each and every metric! Lastly, consider this idea as an addition to your project: the idea is to additionally create a Markov Chain mathematical expression generator trained on freely available mathematical expressions which are assumed to be "perfect" as far as aesthetics are concerned. So you will also need a parser which I guess you already have. I guess mathematical expressions with the best quality would be from parsing latex equations from papers in, say, https://arXiv.org. Which, !!!hail open source!!!, offer the paper source along with pdf and postscript, e.g. https://arxiv.org/src/2412.17766. So, from my comfortable sofa: latex equation parser and a markov-chain mathematical expression generator. Along these lines there is this n-dimensional statistical analysis of DNA sequences (or text, or ...) which resulted in a package I submitted here: https://github.com/hadjiprocopis/Algorithm-Markov-Multiorder-Learner bw, bliako	[reply] [d/l] [select]
Re^2: Testing: shades of grey by etj (Priest) on Dec 24, 2024 at 12:39 UTC
I'm no expert on Markov chains, but I'm aware of PDL::Ngrams which might help with the Markov algorithm, particularly performance.	[reply]
Re: Testing: shades of grey by karlgoethebier (Abbot) on Dec 21, 2024 at 19:20 UTC
Don't you have to call `prove -v`? ŤThe Crux of the Biscuit is the Apostropheť	[reply] [d/l]
Re: Testing: shades of grey by hv (Prior) on Dec 24, 2024 at 16:48 UTC
Thank you all for your comments With regard to the second question, the current verbose output looks like: `ok 12 - expr -a+b ok 13 - expr 1/a ok 14 - expr a.b not ok 15 - expr a/b # TODO # Failed (TODO) test 'expr a/b' # at t/expr-stringify.t line 58. # got: 'a.(1/b)' # expected: 'a/b' ok 16 - expr a.(1/b) ok 17 - expr (1/a).b ok 18 - expr a^2` [download] .. which I cannot trivially eyeball (certainly once there are 100s of tests) to distinguish true FAILs from TODOs. I could live with eyeballable verbose output looking like: `ok 12 - expr -a+b ok 13 - expr 1/a ok 14 - expr a.b not ok 15 - expr a/b # TODO ok 16 - expr a.(1/b) ok 17 - expr (1/a).b ok 18 - expr a^2` [download] .. but would prefer non-verbose `prove` output that in some way (any way) mentioned "and 1 TODO". I understand that App::Prove is intended to be extensible, so I was hoping somebody would already have done this or have known how to do it, but I guess I'll need to dig in to that. For the first question, I will have to take some time to absorb all the suggestions. Certainly I will start with [LanX]'s suggestion to put all OK results into one test - maybe I'd actually put them in an array rather than a hash, which gives me a free (if low-quality) ranking. But I need to think more about how to decide "OKness" scores for each of the results, and significance (weighting) scores for one test versus another. My code only implements the mathematical constructs and the transformation rules that I have needed so far, so I don't think I can usefully test it at this point against other proof checkers, nor usefully verify it against scraped sources of LateX expressions. That said, if Lean been designed to be useful with a command line and text files I probably wouldn't be developing this.	[reply] [d/l] [select]
Re^2: Testing: shades of grey by LanX (Saint) on Dec 24, 2024 at 23:56 UTC
I think you should read this sub thread (again): Re: Testing: shades of grey You can bundle test routines into abstract functions fine tune output and behavior with Test::Builder analyse the output with TAP::Parser Regarding the weights, I'd start with reasonable defaults unless specified. For instance if you have an order of preferable results, degrade the weight appropriately if better options are missed. And I'd group different features in different TODO classes. Like that you can have multi-dimensional quality metrics and are free to combine them into one dimension again. All this gives you plenty of possibilities of easy adjustments at a later stage. Cheers Rolf _{(addicted to the Perl Programming Language :) see Wikisyntax for the Monastery}	[reply]
Re^3: Testing: shades of grey by LanX (Saint) on Dec 26, 2024 at 02:08 UTC
Maybe this minimal demo gets you started? It's implementing a basic test function "rank" which tests `eq` and does a linear quality ranking from 1 to 0 against the provided list. The output includes a field `"quality:"` You are free to extend this in multiple ways, like providing the internal test function, e.g `rank( 'is', ,...)` adjusting the ranking curve checking if $TODO is set taking advantage of the todo_output handle to collect metrics separately As already said I'd use TAP::Parser to find the quality fields. use strict; use warnings; use Test::More; TODO: { local $TODO = "STRINGIFICATION"; rank( 'a-b' => ['a+(-b)', 'a+-b', 'a-b'], "to pass" ); rank( 'a--b' => ['a+(-b)', 'a+-b', 'a-b'], "to fail" ); } sub rank { my ($got, $a_exps, $name) = @_; local $Test::Builder::Level = $Test::Builder::Level + 1; my @exps = @$a_exps; my $pos = 0; my %rank = map { $_ => 1- ($pos++ / @exps) } @exps; my $quality = $rank{$got}; if($quality){ pass($name); diag( sprintf "%12s: %.2f", "quality", $quality ); } else { fail($name); diag( sprintf "%12s: %s", "got","'$got'" ); diag( sprintf "%12s: %s", "expected", join ", ", map {"'$_'"} +@exps ); diag( sprintf "%12s: %.2f", "quality", 0 ); } } done_testing; exit; __END__ [download] `ok 1 - to pass # TODO STRINGIFICATION # quality: 0.33 not ok 2 - to fail # TODO STRINGIFICATION # Failed (TODO) test 'to fail' # at /home/lanx/perl/pm/test_rank.pl line 10. # got: 'a--b' # expected: 'a+(-b)', 'a+-b', 'a-b' # quality: 0.00 1..2` [download] Cheers Rolf _{(addicted to the Perl Programming Language :) see Wikisyntax for the Monastery}	[reply] [d/l] [select]
Re: Testing: shades of grey by LanX (Saint) on Dec 21, 2024 at 15:02 UTC
The metric is in the eye of the beholder. ;) I'd store all OK stringifications in a hash and fail if it's not included. As an idea: If you have different levels of nice-to-have probably try to `SKIP` / `TODO` those depending on a global threshold `$NICE_LEVEL` . Changing it should give you an indication of "more OK". (Disclaimer: never tried this) If you log the level for each block, you can try to calculate a rational number metric afterwards. Please clarify if I misunderstood your question. Cheers Rolf _{(addicted to the Perl Programming Language :) see Wikisyntax for the Monastery} PS: I'm so proud of me to have avoided any pun referencing "50 shades of ..." Oops! ;)	[reply] [d/l] [select]
Re: Testing: shades of grey by etj (Priest) on Dec 22, 2024 at 16:14 UTC
My understanding of TODO tests is they're all expected to fail, and what's unexpected (and reported on) is for them to pass. As for the "how correct" problem, my approach to that situation would be: make a function that evaluates (say on a scale from 0-1) how correct a given expression is, against given criteria make an "ok_ish" function that takes the output of the above, together with a minimum value, and fails if that value drops below a threshold That second one would allow spotting regressions in how correct one's parsing was.	[reply]
Re^2: Testing: shades of grey by talexb (Chancellor) on Dec 22, 2024 at 16:54 UTC
My understanding of TODO tests is they're all expected to fail, and what's unexpected (and reported on) is for them to pass. I politely disagree with this. The `TODO` tests are ones that run, and the result is reported, but whether they pass or not is academic. Maybe they passed, and maybe they failed. We don't care -- it's just 'interesting'. :) Eventually the stuff that was `TODO` gets done, and the `TODO` can be moved into the realm of real tests, and you move on. Alex / talexb / Toronto For a long time, I had a link in my .sig going to Groklaw. I heard that as of December 2024, this link is dead. Still, thanks to PJ for all your work, we owe you so much. RIP Groklaw -- 2003 to 2013.	[reply] [d/l] [select]
Re^3: Testing: shades of grey by hippo (Archbishop) on Dec 23, 2024 at 09:46 UTC
Yes, but in the context of hv's original question, `prove` does not by default report TODO tests which fail but it does report TODO tests which pass albeit without counting those towards an overall pass/fail of the test file/suite. That's the relevant part, AIUI. 🦛	[reply] [d/l]
Re^4: Testing: shades of grey by LanX (Saint) on Dec 23, 2024 at 23:02 UTC
Re^5: Testing: shades of grey by pryrt (Abbot) on Dec 23, 2024 at 23:21 UTC
Re: Testing: shades of grey by InfiniteSilence (Curate) on Dec 22, 2024 at 06:16 UTC
"...results are better than others..." My guess is, from a software engineering standpoint, you are interested in this because you want your solution to show a preference depending on an eventual 'confidence value' which would be expressed as a number between zero and one depending on various input and derived factors. Once upon a time this was the core idea behind a branch of computer science that eventually morphed into today's artificial intelligence -- expert system development. If interested there is an online working example of such a system called tmycin (written in lisp though) with examples of identifying things like species of snakes or different types of rocks. The user answers a series of prompts while the system computes a confidence value behind the scenes. The end result is a suggestion based on the system's best guess -- a diagnosis or, in your case, the best result from among a list of competitors. With some work this could be reimplemented in Perl or you can check CPAN for a lisp interpreter. I am not aware of a test module in CPAN that utilizes this technique but would be happy to have one myself if it indeed does exist. Celebrate Intellectual Diversity	[reply]
Re: Testing: shades of grey by LanX (Saint) on Dec 23, 2024 at 23:45 UTC
here another approach: repeating the tests according to their importance will create a metric at the end `use Test::More; sub weight (&$$) { my ($c_test, $name, $weight) = @_; my $res = subtest $name, $c_test; ok($res,$name) for 2.. $weight; } weight { is "a","a" } "PASSING" => 3; weight { is "a","X" } "FAILING" => 2; done_testing;` [download] `# Subtest: PASSING ok 1 1..1 ok 1 - PASSING ok 2 - PASSING ok 3 - PASSING # Subtest: FAILING not ok 1 # Failed test at /home/lanx/perl/pm/test_metric.pl line 16. # got: 'a' # expected: 'X' 1..1 # Looks like you failed 1 test of 1. not ok 4 - FAILING # Failed test 'FAILING' # at /home/lanx/perl/pm/test_metric.pl line 6. not ok 5 - FAILING # Failed test 'FAILING' # at /home/lanx/perl/pm/test_metric.pl line 8. 1..5 # Looks like you failed 2 tests of 5. <--- Metric` [download] As you see, you are quite flexible with a bit of functional programming. Another approach might be using Test::Builder to create your own semantics. update For instance you can use `$Test->no_diag(0);` to make the repeated fails less verbose. update or you can use `my @tests = $Test->details;` to get a detailed overview which tests passed and if they were todos. `# test_metric.pl:33: [ # { actual_ok => 1, name => "PASSING", ok => 1, reason => "", type = +> "" }, # { actual_ok => 1, name => "PASSING", ok => 1, reason => "", type = +> "todo" }, # { actual_ok => 1, name => "PASSING", ok => 1, reason => "", type = +> "todo" }, # { actual_ok => 0, name => "FAILING", ok => 0, reason => "", type = +> "" }, # { actual_ok => 0, name => "FAILING", ok => 1, reason => "", type = +> "todo" }, # ]` [download] after putting the weight into the name, you can easily calculate your desired metric. NB: `->details` also works locally for subtests only. Cheers Rolf _{(addicted to the Perl Programming Language :) see Wikisyntax for the Monastery}	[reply] [d/l] [select]
Re^2: Testing: shades of grey by LanX (Saint) on Dec 24, 2024 at 16:36 UTC
Finally you can also use TAP::Parser to scan the output. So if I were you, I'd use the demonstrated abstraction in combination with Test::Builder to write your own appropriate test-semantic and put the "weight" into the comments. I would than parse the verbose output and calculate the desired "metric" from the weights, with TAP::Parser. I wouldn't mess with the classic test result output, cause it's a different "all or nothing" metric apart from the "better" metric. So better put non-crucial tests into TODO blocks, because they are not kill criteria. Regarding testing things with various level of "ok-ness", I'd write a test which tests them from best to worst and bails out as soon as it passes -i.e. reporting OK-and logging the weight in the comments. Like if-elsif-... chains or in case of simple strings hash look-ups with the weight as values. A routine like `isinchain( got, [ [exp1,weight1,name1], etc... ] , name)` could do the abstraction That's all very abstract because you're description was very abstract too. Hope this helps. :) Cheers Rolf _{(addicted to the Perl Programming Language :) see Wikisyntax for the Monastery}	[reply] [d/l]

As an idea:

update

update