Polyglot has asked for the wisdom of the Perl Monks concerning the following question:
To the real gurus of programming!
I am faced with a special challenge of comparing two different documents where revisions have been made and the changes need to be highlighted. It's difficult to know how to go about this. I have begun by splitting each document into its sentences, and then have had to do some manual line alignments to align them, as in some cases entire sentences were added or removed. Following this, I began the process of comparing via one sentence at a time.
So, assume we have only two sentences to compare. We want only the revisions to be annotated/marked, leaving the parts of the sentences which are still the same unmarked, even if their positioning within the sentence is offset.
Not being able to come up with something better, I have marked only the unique words in each sentence. But it only catches some of the differences.
I did it something like this:
#SPLIT THE SENTENCES INTO TOKENS FOR INDIVIDUAL COMPARISON @tokens1 = split(/((?:<[^>]+>)+|(?:\s)+|(?:\w[A-Za-z'-]*\w*)+|(?:\W|\P +{IsWord})|(?:\p{IsDigit}))/, $line1); @tokens2 = split(/((?:<[^>]+>)+|(?:\s)+|(?:\w[A-Za-z'-]*\w*)+|(?:\W|\P +{IsWord})|(?:\p{IsDigit}))/, $line2); foreach $token (@tokens1) { #ESCAPE CHARS TO AVOID REGEXP ISSUES IN SUBSTITUTION $token =~ s/([][}{)\(\?.\+\*])/\\$1/g; if (($token ne '') && ($token !~ /^(?:[ .:;'"}{\]\[\(\)!\?\*\+\-]) ++$/)) { unless ($line2 =~ m/$token/gi) { $line1 =~ s~\b($token)\b~<span class="m">$1</span>~gi; }} } foreach $token (@tokens2) { $token =~ s/([][}{)\(\?.\+\*])/\\$1/g; if (($token ne '') && ($token !~ /^(?:[ .:;'"}{\]\[\(\)!\?\*\+\-]) ++$/)) { unless ($line1 =~ m/$token/gi) { $line2 =~ s~\b($token)\b~<span class="m">$1</span>~gi; }} }
Here are some samples of the text, noting versions (A) and (B) and how they were marked.
Example 1.
(A) A few moments will suffice to commit it to memory; yet the period which it covers, commencing more than twenty-five centuries ago, reaches on from that far-distant point past the rise and fall of kingdoms, past the setting up and overthrow of empires, past cycles and ages, past our own day, over into the eternal state.
(B) A few moments will suffice to commit it to memory, yet the period which it covers, beginning more than twenty-five centuries ago, reaches from that far-distant point past the rise and fall of kingdoms, past the setting up and overthrow of empires, past cycles and ages, past our own day, to the eternal state.
Example 2.
(A) Now opens one of the sublimest chapters of human history.
(B) Now opens one of the most comprehensive of the histories of world empires.
Example 3.
(A) With what interest, as well as astonishment, must the king have listened, as he was informed by the prophet that he, or rather his kingdom, the king being here put for his kingdom (see the following verse), was the golden head of the magnificent image which he had seen.
(B) With what interest and astonishment must the king have listened as he was informed by the prophet that his kingdom was the golden head of the magnificent image.
In the first example above, only the differences between "beginning" and "commencing", followed by the "over into", are noted. These are the only unique words when the two sentences are compared against each other. But the first sentence also has an "on" inserted and the second a "to" that replaced the unique words of its counterpart. Those underlined words exist somewhere else in the counterpart sentence, so they are not unique and are not marked.
In the second example above, it would be desirable to have the entire phrase "sublimest chapters of human history" marked as different from the entire phrase "most comprehensive of the histories of world empires." Perhaps it would be a complicating factor that, positionally, the two of's in each of those expressions do line up, making their distinction more difficult to catch. I'd be content if all but that word of those phrases were marked--but it would be nicer to have the whole phrase caught as a unit.
In the third example, the problem with word alignment becomes more apparent. We have three words "as well as" in (A) and only "and" in (B) at the same position. This means the remainder of the sentences, though still much the same, may now be hard for the parser to compare as they are positionally out of alignment. Note also the comma after "astonishment" in one sentence only.
I'm quite happy if the parser ignores differences in punctuation and capitalization--for my purposes, just words and meanings are the focus. It's okay if such minor differences are marked in some way, but not necessary.
Honestly, I just can't wrap my brain around how this task might be accomplished. I experimented with:
use Algorithm::NeedlemanWunsch;
But was unable to achieve the results I wanted. How would you do this?
Blessings,
~Polyglot~
|
|---|