Finding changed words

swiftone has asked for the wisdom of the Perl Monks concerning the following question:

As part of a text-to-html project I'm on, I need to display changes in the text from one version to the next in bold. Happily, I'm storing the text in CVS, so getting a diff is quite simple.

My problem is that the client is very specific: the bolded sections need to be the word, phrase,sentence, or paragraph that is different, but not more. Omissions are unmarked (don't ask why). the CVS diff identifies changed lines between drafts, but I need to pull changed words out of them. (Note that 'lines' in this case are actually paragraphs)

My idea so far was to use Algorithm::Diff, which does element-by-element comparisons of two lists. I can split the lines into lists of words, and run that through it. My trouble now is figuring out how to translate that into bolding. This is not aided by the fact that at somep point I have to run the line through HTML::Entities::encode_entities(), which will move stuff around, and break any bolding put in by a regexp.

Algorithm::Diff will give me output like:

         [
          [ [ '-', 0, 'a' ] ],

          [ [ '+', 2, 'd' ] ],

          [ [ '-', 4, 'h' ] ,
            [ '+', 4, 'f' ] ],

          [ [ '+', 6, 'k' ] ],

          [ [ '-', 8, 'n' ],
            [ '-', 9, 'p' ],
            [ '+', 9, 'r' ],
            [ '+', 10, 's' ],
            [ '+', 11, 't' ],
          ]
        ]
[download]

Except that in my case, the letters will be words. Can anyone think of a relatively elegant way to mark changed sections in <B></B> tags, while still working with encode_entities and not getting confused by punctuation?

Comment on Finding changed words Select or Download Code

Replies are listed 'Best First'.
Re: Finding changed words by tye (Sage) on Sep 16, 2000 at 00:54 UTC
And combining merlyn's and my answers I give you: #!/usr/bin/perl -w use strict; use Algorithm::Diff qw( traverse_sequences ); exit main(); { my( $outState, %markUp, %htmlEnt ); BEGIN { %markUp= ( same => ['',''], new => ['<strong>','</strong>'], old => ['<font size="-1"><strike>','</strike></font>'], ); %htmlEnt= ( "<"=>"lt", "&"=>"amp", ">"=>"gt", "["=>"#91", "]"=>"#93" ); } sub output { my( $style, $text )= @_; if( ! defined $outState ) { print $markUp{ $outState= $style }[0]; } elsif( $style ne $outState ) { print $markUp{$outState}[1], $markUp{$style}[0]; $outState= $style; } $text =~ s#([][<>&])#&$htmlEnt{$1};#g; $text =~ s#\n#<br>\n#g; print $text; } } sub flush { my( $meth, $rOld, $rNew )= @_; return if "" eq $$rOld && "" eq $$rNew; if( "" eq $$rOld ) { output( "new", $$rNew ); } elsif( "" eq $$rNew ) { output( "old", $$rOld ); } elsif( ! $meth ) { output( "old", $$rOld ); output( "new", $$rNew ); } else { my( $meth, $old, $new )= &$meth( $rOld, $rNew ); compare( $meth, $old, $new ); } $$rOld= $$rNew= ""; } sub compare { my( $meth, $old, $new )= @_; my( $oldTemp, $newTemp )= ( "", "" ); traverse_sequences( $old, $new, { MATCH => sub { flush( $meth, \$oldTemp, \$newTemp ); output( "same", $old->[$_[0]] ) }, DISCARD_A => sub { $oldTemp .= $old->[$_[0]] }, DISCARD_B => sub { $newTemp .= $new->[$_[1]] }, } ); flush( $meth, \$oldTemp, \$newTemp ); } sub SentToWord { my( $rOld, $rNew )= @_; return( undef, [ split /(?<=\s)(?=\S)/, $$rOld ], [ split /(?<=\s)(?=\S)/, $$rNew ], ); } sub ParaToSent { my( $rOld, $rNew )= @_; return( \&SentToWord, [ split /(?<=[.?!]\s)/, $$rOld ], [ split /(?<=[.?!]\s)/, $$rNew ], ); } sub main { die "Usage: $0 old new >dif\n" unless 2 == @ARGV; my( $old, $new ); { local($/,@ARGV)= ('',$ARGV[0]); $old= [<>]; } { local($/,@ARGV)= ('',$ARGV[1]); $new= [<>]; } compare( \&ParaToSent, $old, $new ); output( "same", "" ); exit 0; } [download] This compares non-HTML files in a way that produces HTML that shows the differences. Update: I added support for quoting [ and ] and (temporailly) put some sample output from this on my home node. - tye (but my friends call me "Tye")	[reply] [d/l]
Re: Finding changed words by merlyn (Sage) on Sep 15, 2000 at 23:00 UTC
See my solution at the snippet "Showing differences between two sequences". -- Randal L. Schwartz, Perl hacker	[reply]
Re: Finding changed words by tye (Sage) on Sep 15, 2000 at 22:49 UTC
I'd actually convert paragraphs to lines so I could diff those. Then, for each paragraph that was different, convert sentences to lines and diff those. For each of those, convert words to lines and diff those. Sure, it is difficult, but I think that is the fault of the specification. (: - tye (but my friends call me "Tye")	[reply]
Re: Finding changed words by extremely (Priest) on Sep 16, 2000 at 01:35 UTC
No one answered half your question so I'll take a stab at it. You need the "bolding" to survive the conversion process to "entities". The number one suggestion would be to run both texts thru encode_entities BEFORE comparing them. Then your added HTML is safe. A second suggestion would be to use a special chracter that can't appear in the body. like \0175 or some really low char like \f the formfeed char. encode_entities will likely grab that and turn it into something like ý or some so that after encode_entities you can regexp that marker with the html you want. Use one marker for bold on and one for bold off. -- $you = new YOU; honk() if $you->love(perl)	[reply]
RE: Re: Finding changed words by tye (Sage) on Sep 16, 2000 at 01:47 UTC
FYI, my solution encoded HTML entities by hand during output so the bolding code doesn't get turned into entities. - tye (but my friends call me "Tye")	[reply]


Perl: the Markov chain saw
	PerlMonks