flaviusm,

your problem is in principle NP-hard. However, the hardest part of the problem is to determine whether a word is "unmoved" (these are the unmarked ones in my output, see also Re^3: diff of two strings). This can be done in O(nm) with this dynamic programming ansatz. How the words are finally printed out in printDiff based for example on uniqueness (insertion, but only in the second string) or moved position (in both strings, but a insertion/deletion) are IMHO trivial changes.

I do not understand your note and I still think the LCS is the way to go. Do you understand the algorithm? If yes, then please tell my why or when the LCS fails.

If you really want to use common substrings to construct a skeleton, try Tree::Suffix, which is also mentioned in my String::LCSS_XS module. But keep in mind that these modules are character based (so they are much slower without any re-encoding tricks) and that the free suffix tree/array implementations are in general not as good as they could be. Optimizing the lcs algorithm in C is really easy. See the wikipedia article for details. With the linear memory modification, the data should even fit in the cache for not too large strings.


In reply to Re^5: diff of two strings by lima1
in thread diff of two strings by flaviusm

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.