in reply to diff of two strings

Interesting problem, I had fun playing with this one. Thanks for posting it. I'll say in advance, my results do not match the specified results 100%, specifically my code marks the words "the", "best", and "perl" as moved, since their absolute positions in the two strings is different. If you could elaborate on why those words should not be marked as moved, I can adjust the algorithm when I get home this evening.
use strict; use warnings; use Data::Dumper; my $str1 = 'Perlmonks is the best perl community'; my $str2 = 'Perlmonks is one of the best community of perl users'; if ($str1 eq $str2) { print $str1; exit; } my @wl1 = split /\s+/, $str1; my @wl2 = split /\s+/, $str2; my $wp1 = build_word_hash(\@wl1); my $wp2 = build_word_hash(\@wl2); my $diff_str1 = ''; my $diff_str2 = ''; while (@wl1 || @wl2) { my $word1 = shift @wl1; my $word2 = shift @wl2; if ($word1 && $word2 && $word1 eq $word2) { $diff_str1 .= $word1 . ' '; # pairing the word from the origio +nal string with it's output $diff_str2 .= $word2 . ' '; # lets us do things like case inse +nsitive, but preserving match later shift @{$wp1->{$word1}}; # eat this word shift @{$wp2->{$word2}}; # eat this word next; } #process word1 first, for fun if ($word1) { if ($wp2->{$word1} && @{$wp2->{$word1}} && ! grep {$_ == $wp2->{$word1}->[0]} @{$wp1->{$word1}} ) +{ # word moved. # the grep checks that the next occurance of the word in s +tring 2 ($wp2->{$word}->[0] # does not also have an occurance of the word in string 1. # if it does not, it means that this is a move of the word +. $diff_str1 .= "[$word1] "; shift @{$wp2->{$word1}}; # eat this word } else { # Easy case, word in string 1 but not string 2 $diff_str1 .= "<$word1> "; } } if ($word2) { if ($wp1->{$word2} && @{$wp1->{$word2}} && ! grep {$_ == $wp1 +->{$word2}->[0]} @{$wp2->{$word2}} ) { $diff_str2 .= "[$word2] "; shift @{$wp1->{$word2}}; # eat this word } else { $diff_str2 .= "<$word2> "; } } } print "$diff_str1\n$diff_str2\n"; sub build_word_hash { my $wl = shift; my $res = {}; my $i = 0; foreach my $word( @$wl ) { push @{$res->{$word}} , $i++; } return $res; }
results:
Perlmonks is [the] [best] [perl] [community] Perlmonks is <one> <of> [the] [best] [community] <of> [perl] <users>
Sorry for sloppy code, I was writeing this on my lunch break.

Replies are listed 'Best First'.
Re^2: diff of two strings
by flaviusm (Acolyte) on Jan 08, 2008 at 20:14 UTC

    chaos_cat,

    Thank you very much for taking time to "play" with my problem and also for the patience with my incomplete requirements.

    Some clarifications:
    - the new words will not modify the position in the string of the words that follow (this is true for the original string as well as for the modified string).

    e.g.
    ----

    original: Perlmonks is ..................... the best [perl] ...................[community]
    modified: Perlmonks is <one> <of> the best [community] <of> [perl] <users>

    Note: "the" is the 3rd word in the original and it is considered to be the 3rd word in the modified string also because "<one>" and "<of>" are new words and should be ignored (will not increase the position value)

    while

    original: ..... the best [perl] ....................[community]
    modified: ... the best [community] <of> [perl] <users>

    Note: "perl" and "community" should be marked as changed/moved, because they changed the position in the sentence.

    words ranks:
    ------------
    original: ......1.......2....................3....4.....[5].............[6]...........
    modified: ....1.......2..<7>..<8>...3....4.....[6]...<9>...[5]......<10>.

    Please let me know if you don't understant what I tried to explain above and I will explain it in different words.

    Thanks a lot.

      Ah, ok, that makes sense. Actually working with that idea has enabled me to clean the code up a lot. The code should run in O(N) time, with N being the sum of the word counts of the strings. It makes two passes over the strings, once to build the word count hashes and once to do the actual diff.
      use strict; use warnings; use Data::Dumper; my $str1 = 'Perlmonks is the best perl community'; my $str2 = 'Perlmonks is one of the best community of perl users'; if ( $str1 eq $str2 ) { print "$str1\n$str2\n"; exit; } my @wl1 = split /\s+/, $str1; my @wl2 = split /\s+/, $str2; my $diff_str1 = ''; my $diff_str2 = ''; my %wc1; my %wc2; foreach my $word (@wl1) { $wc1{$word}++; } foreach my $word (@wl2) { $wc2{$word}++; } while (@wl1 || @wl2) { my $word1 = ''; my $word2 = ''; # being sloppy and not decrementing word counts for new words # since we only use them for moves while (!$wc2{$word1}) { $diff_str1 .= "<$word1> " if $word1; $wc1{$word1}-- if $word1; $word1 = ''; # prevent fall through if this is the last word last if !@wl1; $word1 = shift @wl1; } while (!$wc1{$word2}) { $diff_str2 .= "<$word2> " if $word2; $wc2{$word2}-- if $word2; $word2 = ''; last if !@wl2; $word2 = shift @wl2; } if ( $word1 && $word2 && $word1 eq $word2 ) { $diff_str1 .= $word1 . ' '; # pairing the word from the ori +gional string with it's output $diff_str2 .= $word2 . ' '; # lets us do things like case i +nsensitive, but preserving match later } else { $diff_str1 .= "[$word1] " if $word1; $diff_str2 .= "[$word2] " if $word2; } $wc1{$word2}--; $wc2{$word1}--; } print "$diff_str1\n$diff_str2\n";
      output:
      Perlmonks is the best [perl] [community] Perlmonks is <one> <of> the best [community] <of> [perl] <users>
      The only weird case i found is if a word occurs in one string more than in the other, and they occur in the same position but not first. For example:
      $str1 = 'Perlmonks is the best perl perl community'; $str2 = 'Perlmonks is one of the best community of perl users'; --------- Perlmonks is the best [perl] <perl> [community] Perlmonks is <one> <of> the best [community] <of> [perl] <users>
      I'm not sure if that's correct or not by your standard. My first algorithm had accounted for this with the position hash, by looking ahead to see if a later occurrence of the word was in the same absolute position in the string. I did away with that since absolute position isn't in fact what you were interested in, but if you need to account for this case differently, something like that could be done, with an offset added to account for how the new words change the positions in the string.

      I'm somewhat curious what real world problem you're trying to solve with this. Depending on what you're doing, you might be able to get better results from a different methodology. For example, if you're trying to build a plagiarism detector, you might want to look into some kind of document similarity type algorithm.

        Excelent! I will try it later today on a big corpus of documents to see if I can spot any exceptions. I will come back with feed-back.

        multiple occurence:
        If the sentence contains two or more consecutive identical words, it doesn't matter which one is marked "new" and which one is marked "moved".

        real world problem:
        This program is meant to help detect the change in semantic of a corpus of similar documents.
        Since word order and new words are the first candidates for a semantic modification I need such a program to detect them and put them in paralel.

      original: Perlmonks is ...... the best perl community modified: Perlmonks is one of the best .... community of perl users
      this is exactly what the LCS does. Everything else are trivial output/postprocessing rules. Not?