Ah, ok, that makes sense. Actually working with that idea has enabled me to clean the code up a lot. The code should run in O(N) time, with N being the sum of the word counts of the strings. It makes two passes over the strings, once to build the word count hashes and once to do the actual diff.
use strict; use warnings; use Data::Dumper; my $str1 = 'Perlmonks is the best perl community'; my $str2 = 'Perlmonks is one of the best community of perl users'; if ( $str1 eq $str2 ) { print "$str1\n$str2\n"; exit; } my @wl1 = split /\s+/, $str1; my @wl2 = split /\s+/, $str2; my $diff_str1 = ''; my $diff_str2 = ''; my %wc1; my %wc2; foreach my $word (@wl1) { $wc1{$word}++; } foreach my $word (@wl2) { $wc2{$word}++; } while (@wl1 || @wl2) { my $word1 = ''; my $word2 = ''; # being sloppy and not decrementing word counts for new words # since we only use them for moves while (!$wc2{$word1}) { $diff_str1 .= "<$word1> " if $word1; $wc1{$word1}-- if $word1; $word1 = ''; # prevent fall through if this is the last word last if !@wl1; $word1 = shift @wl1; } while (!$wc1{$word2}) { $diff_str2 .= "<$word2> " if $word2; $wc2{$word2}-- if $word2; $word2 = ''; last if !@wl2; $word2 = shift @wl2; } if ( $word1 && $word2 && $word1 eq $word2 ) { $diff_str1 .= $word1 . ' '; # pairing the word from the ori +gional string with it's output $diff_str2 .= $word2 . ' '; # lets us do things like case i +nsensitive, but preserving match later } else { $diff_str1 .= "[$word1] " if $word1; $diff_str2 .= "[$word2] " if $word2; } $wc1{$word2}--; $wc2{$word1}--; } print "$diff_str1\n$diff_str2\n";
output:
Perlmonks is the best [perl] [community] Perlmonks is <one> <of> the best [community] <of> [perl] <users>
The only weird case i found is if a word occurs in one string more than in the other, and they occur in the same position but not first. For example:
$str1 = 'Perlmonks is the best perl perl community'; $str2 = 'Perlmonks is one of the best community of perl users'; --------- Perlmonks is the best [perl] <perl> [community] Perlmonks is <one> <of> the best [community] <of> [perl] <users>
I'm not sure if that's correct or not by your standard. My first algorithm had accounted for this with the position hash, by looking ahead to see if a later occurrence of the word was in the same absolute position in the string. I did away with that since absolute position isn't in fact what you were interested in, but if you need to account for this case differently, something like that could be done, with an offset added to account for how the new words change the positions in the string.

I'm somewhat curious what real world problem you're trying to solve with this. Depending on what you're doing, you might be able to get better results from a different methodology. For example, if you're trying to build a plagiarism detector, you might want to look into some kind of document similarity type algorithm.

In reply to Re^3: diff of two strings by chaos_cat
in thread diff of two strings by flaviusm

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.