Tux has asked for the wisdom of the Perl Monks concerning the following question:

In TPC in Glasgow I released App::ccdiff, which - in short - will more clearly shows horizontal diff as well as vertical diff.

That might look like (with all verbosity on) like this screenshot

$ ccdiff -u0m --ascii termc* 5,5c5,5 - + + ^ 40,41c40,41 - :Va=\E[0m:Vc=\E[0;33m:Ve=\E[0;4m:Vg=\E[0;4;36m:\ - ^ - :Vi=\E[0;37;41m:Vk=\E[0;1;33;41m:Vo=\E[0;1;36;41;4m:cQ=\E?25I: - ^ ^ ^ ^ + :Va=\E[0m:Vc=\E[0;36m:Ve=\E[0;4m:Vg=\E[0;4;36m:\ + ^ + :Vi=\E[0;37;44m:Vk=\E[0;1;37;44m:Vo=\E[0;1;36;44;4m:cQ=\E?25I: + ^ ^ ^ ^

This works fine for the purpose it is written for: find tiny changes with more ease.

It however makes no sense if chunk shows a change of 4 lines to 24 lines with a completely different content, in which case you just want to see the chunk as lines-deleted + lines-added, with no markers to the changed characters in there, as that would mean that almost every character will be marked.

As I currently see it, there are multiple approaches to the fallback of the current behavior to a normal diff report:

It is possible to implement both and allow both at the same time.

  1. Did I state the problem well enough?
  2. Do these options make sense?
  3. Do the defaults make sense?
  4. Do you envision other options (that you would use)?

Before I start coding/changing, I'd like opinions on how you would use it and/or expect it to use, in order to raise DWIM behavior

Enjoy, Have FUN! H.Merijn

Replies are listed 'Best First'.
Re: When not to use subdiff
by TheloniusMonk (Sexton) on Aug 23, 2018 at 07:48 UTC
    IMO you need to give priority to the niche you are in rather than worry too much about whether the input falls in your niche. The user can always run a separate old-school diff. But your own output should be rigorously predictable in format, so that people can write code to process it. The only way I see to do that is to have a rigid default behaviour first and have such options as extras. If %change is important in your problem, I would be inclined to have a switch that replaces the functionality with only a statistical analysis that the user can then consider before choosing the next step and that also each such option, not just the default, should stick to the rule of rigorous predictability in the interests of those who will process the output.
Re: When not to use subdiff
by dsheroh (Monsignor) on Aug 25, 2018 at 08:42 UTC
    I ran into some surprising output one of the first times I used ccdiff after getting back to work this week:
    - <ds:KeyName></ds:KeyName> - ^^^^^^^ ^^^ ^^^^^^^^^ - <ds:KeyName></ds:KeyName> - ^^^^^^^^^^^^^^^^^^^^^^^^ + <ds:KeyName></ds:KeyName> + ^^
    Took me a minute to figure out that it was seeing the diff as "up to 'foo' from the first line, then insert '-t', grab an 'e' and an 's' from later in the first line, and finally take everything starting with 't' on the second line" rather than "insert '-test' on the first line and drop the second line entirely".

    Probably another good case for a "percentage of changed characters is over x%" check.

      Agree. It stands out way better with -r and colors, but still. I've added the files to my sandbox.

      Note that this is still beyond the scope of where I created it for, but I will not ignore this feedback.

      Enjoy, Have FUN! H.Merijn

      Could you pull from the git repo and try again with -h20. You can find what is your intuitive limit and put heuristics : 20 in ~/.config/ccdiff.

      As I got no other suggestions in this thread, I implemented both suggestions.

      Enjoy, Have FUN! H.Merijn
        With -h20 I get:
        - <ds:KeyName></ds:KeyName> - <ds:KeyName></ds:KeyName> + <ds:KeyName></ds:KeyName>
        So I experimented a bit with other heuristic values, trying to find a setting which would give me
        - <ds:KeyName></ds:KeyName> + <ds:KeyName></ds:KeyName> + ^^^^^ - <ds:KeyName></ds:KeyName>
        and found that I get the "classic" diff output for values in the range 2-49, with heuristic values of 1 or 50+ reverting to the original output. Since ccdiff -h describes -h n as "Horizontal char diff treshold"1, I'm guessing that's because the smallest chunks taken in the original output are 1 character, while the complete line (with the real hostname) is 50 characters. Is that a correct description of how the heuristic works or is it just a coincidence?

        1 When I pasted that, my spellcheck caught a typo in "treshold" - it's missing an "h".