I am working on a program to highlight the changes between two manuscript collections. So, I am searching for the quickest algorithm/solution to mark the difference between two strings.
The differences should be marked following the rules:
- the difference of the two strings should be word based (not character based)
- new words should be marked between "<" and ">" (e.g. "<new_word>")
- common words that changed their position in the string should be placed between "[" and "]" (e.g. "[changed_place]")
- common words that kept their position in the string should be just copied to output
e.g.
--- original strings ---
Perlmonks is the best perl community
Perlmonks is one of the best community of perl users
--- marked strings ---
Perlmonks is the best [perl] [community]
Perlmonks is <one of> the best [community] <of> [perl] <users>
Current approach:
I currently use the LCSS dynamic algorithm to mark the longest common substring. I compare then the position of the LCSS within the two strings. If the position changed, I mark the substring with "[]", otherwise I leave it unmarked. I do the same for all common substrings. The substring(s) left after all LCSS operations, are considered new and are marked with "<>";
The algorithm is very slow and I have difficulties finding all common substrings between the two strings.
I would appreciate if you can guide me to a different solution, module, algorithm etc.
Thank you.
In reply to diff of two strings by flaviusm
| For: | Use: | ||
| & | & | ||
| < | < | ||
| > | > | ||
| [ | [ | ||
| ] | ] |