Comparing two text files and marking differences

Polyglot has asked for the wisdom of the Perl Monks concerning the following question:

To the real gurus of programming!

I am faced with a special challenge of comparing two different documents where revisions have been made and the changes need to be highlighted. It's difficult to know how to go about this. I have begun by splitting each document into its sentences, and then have had to do some manual line alignments to align them, as in some cases entire sentences were added or removed. Following this, I began the process of comparing via one sentence at a time.

So, assume we have only two sentences to compare. We want only the revisions to be annotated/marked, leaving the parts of the sentences which are still the same unmarked, even if their positioning within the sentence is offset.

Not being able to come up with something better, I have marked only the unique words in each sentence. But it only catches some of the differences.

I did it something like this:

#SPLIT THE SENTENCES INTO TOKENS FOR INDIVIDUAL COMPARISON
@tokens1 = split(/((?:<[^>]+>)+|(?:\s)+|(?:\w[A-Za-z'-]*\w*)+|(?:\W|\P
+{IsWord})|(?:\p{IsDigit}))/, $line1);
@tokens2 = split(/((?:<[^>]+>)+|(?:\s)+|(?:\w[A-Za-z'-]*\w*)+|(?:\W|\P
+{IsWord})|(?:\p{IsDigit}))/, $line2);

foreach $token (@tokens1) {
    #ESCAPE CHARS TO AVOID REGEXP ISSUES IN SUBSTITUTION
    $token =~ s/([][}{)\(\?.\+\*])/\\$1/g;
    if (($token ne '') && ($token !~ /^(?:[ .:;'"}{\]\[\(\)!\?\*\+\-])
++$/)) {
    unless ($line2 =~ m/$token/gi) {
        $line1 =~ s~\b($token)\b~<span class="m">$1</span>~gi;
    }}
}
foreach $token (@tokens2) {
    $token =~ s/([][}{)\(\?.\+\*])/\\$1/g;
    if (($token ne '') && ($token !~ /^(?:[ .:;'"}{\]\[\(\)!\?\*\+\-])
++$/)) {
    unless ($line1 =~ m/$token/gi) {
        $line2 =~ s~\b($token)\b~<span class="m">$1</span>~gi;
    }}
}
[download]

Here are some samples of the text, noting versions (A) and (B) and how they were marked.

Example 1.

(A) A few moments will suffice to commit it to memory; yet the period which it covers, commencing more than twenty-five centuries ago, reaches on from that far-distant point past the rise and fall of kingdoms, past the setting up and overthrow of empires, past cycles and ages, past our own day, over into the eternal state.

(B) A few moments will suffice to commit it to memory, yet the period which it covers, beginning more than twenty-five centuries ago, reaches from that far-distant point past the rise and fall of kingdoms, past the setting up and overthrow of empires, past cycles and ages, past our own day, to the eternal state.

Example 2.

(A) Now opens one of the sublimest chapters of human history.

(B) Now opens one of the most comprehensive of the histories of world empires.

Example 3.

(A) With what interest, as well as astonishment, must the king have listened, as he was informed by the prophet that he, or rather his kingdom, the king being here put for his kingdom (see the following verse), was the golden head of the magnificent image which he had seen.

(B) With what interest and astonishment must the king have listened as he was informed by the prophet that his kingdom was the golden head of the magnificent image.

In the first example above, only the differences between "beginning" and "commencing", followed by the "over into", are noted. These are the only unique words when the two sentences are compared against each other. But the first sentence also has an "on" inserted and the second a "to" that replaced the unique words of its counterpart. Those underlined words exist somewhere else in the counterpart sentence, so they are not unique and are not marked.

In the second example above, it would be desirable to have the entire phrase "sublimest chapters of human history" marked as different from the entire phrase "most comprehensive of the histories of world empires." Perhaps it would be a complicating factor that, positionally, the two of's in each of those expressions do line up, making their distinction more difficult to catch. I'd be content if all but that word of those phrases were marked--but it would be nicer to have the whole phrase caught as a unit.

In the third example, the problem with word alignment becomes more apparent. We have three words "as well as" in (A) and only "and" in (B) at the same position. This means the remainder of the sentences, though still much the same, may now be hard for the parser to compare as they are positionally out of alignment. Note also the comma after "astonishment" in one sentence only.

I'm quite happy if the parser ignores differences in punctuation and capitalization--for my purposes, just words and meanings are the focus. It's okay if such minor differences are marked in some way, but not necessary.

Honestly, I just can't wrap my brain around how this task might be accomplished. I experimented with:

use Algorithm::NeedlemanWunsch;
[download]

But was unable to achieve the results I wanted. How would you do this?

Blessings,

~Polyglot~

Comment on Comparing two text files and marking differences Select or Download Code

Replies are listed 'Best First'.
Re: Comparing two text files and marking differences by afoken (Chancellor) on Jan 30, 2021 at 15:26 UTC
diff, Algorithm::Diff, Text::Diff, String::Diff Alexander -- Today I will gladly share my knowledge and experience, for there are no sweeter words than "I told you so". ;-)	[reply]
Re: Comparing two text files and marking differences by tybalt89 (Monsignor) on Jan 30, 2021 at 17:34 UTC
Here's your examples run through a word diff'r by color I had laying around. Maybe this could be a starting point for solving your problem. #!/usr/bin/perl use strict; # https://perlmonks.org/?node_id=11127688 use warnings; use Algorithm::Diff qw(traverse_sequences); use Term::ANSIColor; while( <DATA> ) { my @from = split /(\s+)/; my @to = split /(\s+)/, <DATA>; traverse_sequences( \@from, \@to, { MATCH => sub {print $from[shift()]}, DISCARD_A => sub {print color('red'), $from[shift()], color 'reset +'}, DISCARD_B => sub {print color('green'), $to[pop()], color 'reset'} +, } ); print "\n\n"; } __DATA__ A few moments will suffice to commit it to memory; yet the period whic +h it covers, commencing more than twenty-five centuries ago, reaches +on from that far-distant point past the rise and fall of kingdoms, pa +st the setting up and overthrow of empires, past cycles and ages, pas +t our own day, over into the eternal state. A few moments will suffice to commit it to memory, yet the period whic +h it covers, beginning more than twenty-five centuries ago, reaches f +rom that far-distant point past the rise and fall of kingdoms, past t +he setting up and overthrow of empires, past cycles and ages, past ou +r own day, to the eternal state. Now opens one of the sublimest chapters of human history. Now opens one of the most comprehensive of the histories of world empi +res. With what interest, as well as astonishment, must the king have listen +ed, as he was informed by the prophet that he, or rather his kingdom, + the king being here put for his kingdom (see the following verse), w +as the golden head of the magnificent image which he had seen. With what interest and astonishment must the king have listened as he +was informed by the prophet that his kingdom was the golden head of t +he magnificent image. [download]	[reply] [d/l]
Re^2: Comparing two text files and marking differences by Polyglot (Chaplain) on Jan 31, 2021 at 07:20 UTC
I appreciate all of the answers, and found that I was able to make some modifications to this one in particular which seems to be yielding most of what I want. I'm still doing a few post-subroutine substitutions to clear up some text formatting issues, but the following subroutine does the bulk of what needed to be done. sub comparator { my $str1 = shift @_; my $str2 = shift @_; my $original = ''; my $revised = ''; my @from = split(/((?:<[^>]+>)+\|(?:\s)+\|(?:\w[A-Za-z'-]\w)+\|(?:\W\|\P +{IsWord})\|(?:\p{IsDigit}))/, $str1); my @to = split(/((?:<[^>]+>)+\|(?:\s)+\|(?:\w[A-Za-z'-]\w)+\|(?:\W\|\P +{IsWord})\|(?:\p{IsDigit}))/, $str2); my $OS = qq\|<span class="m">\|; my $OE = qq\|</span> \|; my $RS = qq\|<span class="hl">\|; my $RE = qq\|</span> \|; traverse_sequences( \@from, \@to, { MATCH => sub { my $oldtext = $from[shift()]; $original .= $old +text; $revised .= $oldtext }, DISCARD_A => sub { my $oldtext = $from[shift()]; if ($oldtext =~ m +/(?:\p{IsPunct})\|(?:\s)/) {$original .= $oldtext } else { $original . += $OS.$oldtext.$OE } }, DISCARD_B => sub { my $newtext = $to[pop()]; if ($newtext =~ m +/(?:\p{IsPunct})\|(?:\s)/) {$revised .= $newtext } else { $revised . += $RS.$newtext.$RE } }, } ); return ($original, $revised); } #END SUB comparator [download] I have never found the output of a standard diff to be very enlightening. I'm sure it works well to change files, patch-style, but it isn't very readable for someone simply wanting to see what happened to the text in a side-by-side format. This procedure is making a visual inspection much easier, with the help of some HTML markup. Thank you! Blessings, ~Polyglot~	[reply] [d/l]
Re^3: Comparing two text files and marking differences by afoken (Chancellor) on Jan 31, 2021 at 14:35 UTC
I have never found the output of a standard diff to be very enlightening. I'm sure it works well to change files, patch-style, but it isn't very readable for someone simply wanting to see what happened to the text in a side-by-side format. Plain old diff (in the GNU version) has at least four output formats: ed script: `/tmp>diff foo bar 1,2c1,2 < Bla bla. Foo bar baz. < Nada nada nada. Nada? --- > Bla bar. Foo bar baz. > Nada na-da nada. Nada? 4c4 < bar. Bla. Bar bla. --- > bar. Bla bar bla.` [download] Unified: `/tmp>diff -u foo bar --- foo 2021-01-31 15:13:16.892239748 +0100 +++ bar 2021-01-31 15:13:43.403869518 +0100 @@ -1,6 +1,6 @@ -Bla bla. Foo bar baz. -Nada nada nada. Nada? +Bla bar. Foo bar baz. +Nada na-da nada. Nada? Foo foo foo! Bar. Foo -bar. Bla. Bar bla. +bar. Bla bar bla. Foo bla bla nada bar.` [download] Side by side (also available via sdiff) `/tmp>diff -y foo bar Bla bla. Foo bar baz. \| Bla ba +r. Foo bar baz. Nada nada nada. Nada? \| Nada n +a-da nada. Nada? Foo foo foo! Bar. Foo Foo fo +o foo! Bar. Foo bar. Bla. Bar bla. \| bar. B +la bar bla. Foo bla bla nada bar. Foo bl +a bla nada bar.` [download] rcs `/tmp>diff -n foo bar d1 2 a2 2 Bla bar. Foo bar baz. Nada na-da nada. Nada? d4 1 a4 1 bar. Bla bar bla.` [download] TortoiseSVN comes with a diff and merge tool called TortoiseMerge that can show changes side by side, highlighting not only changed lines, but also changes within the lines. Side note: `sub comparator { my $str1 = shift @_; #... my $RE = qq\|</span> \|; traverse_sequences( \@from, \@to, { # ... } ); return ($original, $revised); } #END SUB comparator` [download] Proper indenting would make the "#END SUB comparator" redundant: `sub comparator { my $str1 = shift @_; #... my $RE = qq\|</span> \|; traverse_sequences( \@from, \@to, { # ... } ); return ($original, $revised); }` [download] Alexander -- Today I will gladly share my knowledge and experience, for there are no sweeter words than "I told you so". ;-)	[reply] [d/l] [select]
Re^4: Comparing two text files and marking differences by Polyglot (Chaplain) on Jan 31, 2021 at 15:53 UTC
Re: Comparing two text files and marking differences by BillKSmith (Monsignor) on Jan 31, 2021 at 04:05 UTC
Although your task sounds simple, it is impossible to implement perfectly. Given any example, you know exactly what you want. Specifying how to handle that case is not too hard, but real-world examples do not always fall neatly into classes. You have not even considered the problem of text being rearranged. What if it is rearranged and then edited? At what point do you consider a sentence a replacement rather than an edit? I recommend that you 'audition' several of the modules that have been suggested. Choose the one that seems the best. Accept the fact that you will not always like its output. Bill	[reply]
Re: Comparing two text files and marking differences by stevieb (Canon) on Jan 30, 2021 at 16:44 UTC
Why reinvent the wheel? That's one hell of a lot of work when there's `diff` which is specifically for this task, and has been used around the world reliably since almost forever, and it does a heck of a lot more than just highlighting.	[reply] [d/l]
Re: Comparing two text files and marking differences by roho (Bishop) on Jan 31, 2021 at 06:31 UTC
If matching for differences does not have to be done in Perl (tmtowtdi), and if you have the Vim editor, you can display the differences between up to four files. You may want to add the following "if" statement in your "_vimrc" file to make the differences stand out better. `if &diff colorscheme clarity set diffopt=filler,context:0 "context displays 0 lines before and +after lines flagged as differrent. endif` [download] Start the Vim editor as follows: (Note that on Windows, the "start" command frees up the command window while editing) start /b gvim -d file1 file2 Once the diff screens are displayed, use the Vim normal mode command "zr" to display all hidden lines in both files. Save the following text fragments to "file1" and "file2" for testing. file1: A few moments will suffice to commit it to memory; yet the period whic +h it covers, commencing more than twenty-five centuries ago, reaches on fro +m that far-distant point past the rise and fall of kingdoms, past the setting + up and overthrow of empires, past cycles and ages, past our own day, over int +o the eternal state. file2: A few moments will suffice to commit it to memory, yet the period whic +h it covers, beginning more than twenty-five centuries ago, reaches from th +at far-distant point past the rise and fall of kingdoms, past the setting + up and overthrow of empires, past cycles and ages, past our own day, to the e +ternal state. [download] "It's not how hard you work, it's how much you get done."	[reply] [d/l] [select]
Re: Comparing two text files and marking differences by jcb (Parson) on Jan 31, 2021 at 03:20 UTC
If you have an ancestral version of the document, prior to both sets of changes, the `diff3` tool is likely to be useful.	[reply]
Re: Comparing two text files and marking differences by bliako (Abbot) on Jan 31, 2021 at 10:13 UTC
I am ignorant of version control systems but isn't what you want to achieve similar? A computer programmer makes a change to the code and the vcs records that and graphically shows the history of all edits, even by different users. Probably the algorithms for doing this are in some free library and you can call them from your scripts if you don't want to force the users to use a specific environment. Creating a specific environment in which edits are made is another approach if your setup permits: you provide the editor environment within which these changes are made by the reviewer. Because all the changes are made within this editor, it will know what changed, when, by whom and to what, marking these appropriately. The vcs does not necessarily need fresh texts. If you have the progression of a text: T0->T1->T2 then it can probably still work as it does not. I think, tracks user keystrokes but it diffs text as you want. bw, bliako	[reply]
Re^2: Comparing two text files and marking differences by Polyglot (Chaplain) on Jan 31, 2021 at 15:40 UTC
Well, my comparison is of two separate books. The original was published in 1897, and then it was dramatically edited and altered in 1944--not by the original author. The changes have not been widely publicized, but they are rather extensive, and entirely change the meaning at times (sometimes fully opposite the original meaning). My goal is simply to apprise people of the changes in a way that they can visualize them more easily and grasp their significance. One of the books, coming from OCR, seems to have more OCR-related artifacts than the other, and I may end up consulting with a hard-copy version of that book which I happen to have in hand to fix some of those. Seeing the discrepancies will help me locate them more quickly myself. I'm unfamiliar with software versioning systems as well, having never used them. Blessings, ~Polyglot~	[reply]