Finding a LCS module on word level

st_ale has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: Finding a LCS module on word level by Limbic~Region (Chancellor) on May 09, 2008 at 02:17 UTC
st_ale, I adapted the code I wrote Re^2: Longest Common Subsequence to work on words instead of characters (and s/subsequence/substring/). Read more... (2 kB) It works, but I am sure it is the best implementation. I took a solution I wrote for finding the longest common subsequence for more than two strings, adapted it for longest common substring for more than two strings, and then adapted it again to use words than rather than characters. If you want the actual position in the string of the words, you will need to mess with %map. Cheers - L~R	[reply] [d/l]
Re: Finding a LCS module on word level by educated_foo (Vicar) on May 08, 2008 at 23:58 UTC
Have you tried combining Algorithm::LCS with split?	[reply]
Re^2: Finding a LCS module on word level by st_ale (Novice) on May 09, 2008 at 00:08 UTC
Yes, I have tried, but as I mentioned above, like the string1 "perl find perl module" and string2 "perl module", it will return me the "perl" in the front and the "module"in the back, instead of return me the consecutive "perl module" in the back.This is what I do not expect... Thanks.	[reply]
Re^3: Finding a LCS module on word level by educated_foo (Vicar) on May 09, 2008 at 01:30 UTC
My bad -- I was thinking of Algorithm::Diff, which gives you what you want, specifically `LCSidx`.	[reply] [d/l]
Re^4: Finding a LCS module on word level by Limbic~Region (Chancellor) on May 10, 2008 at 00:37 UTC
Re^5: Finding a LCS module on word level by educated_foo (Vicar) on May 10, 2008 at 01:29 UTC
Some notes below your chosen depth have not been shown here
Re^3: Finding a LCS module on word level by Khen1950fx (Canon) on May 09, 2008 at 06:06 UTC
Your question is still a little unclear to me, but this will give you the consecutive "perl module". `#!/usr/bin/perl use strict; use warnings; use String::LCSS_XS qw(lcss); my $longest = lcss ( "perl find perl module", "perl module"); print $longest, "\n";` [download]	[reply] [d/l]
Re^4: Finding a LCS module on word level by Limbic~Region (Chancellor) on May 10, 2008 at 00:33 UTC
Re: Finding a LCS module on word level by ikegami (Patriarch) on May 09, 2008 at 02:59 UTC
So why don't you copy String::LCSS and modify it to work on words instead of characters? The implementation finds the longuest common substring (LCSS) of characters, but the algorithm can just as easily find the LCSS of words. Update: Turns out String::LCSS is buggy and unmaintained. It doesn't work for inputs of the form `lcss("b", "ab");` (like your example) and other inputs. It's apparently using an inefficient algorithm too, according to one of the bug reports (which actually provides an implementation of a better algorithm).	[reply] [d/l]
Re: Finding a LCS module on word level by dragonchild (Archbishop) on May 09, 2008 at 00:53 UTC
I may be missing something, but isn't this kinda like (untested!): `sub lcs { my ($string, $find) = @_; my $l = length $find; while ( $l > 0 ) { my $pos = index( $string, $find ); if ( $pos > -1 ) { return ( $pos, $l ); } $l--; chop $find; } return; }` [download] My criteria for good software: Does it work? Can someone else come in, make a change, and be reasonably certain no bugs were introduced?	[reply] [d/l]
Re^2: Finding a LCS module on word level by ikegami (Patriarch) on May 09, 2008 at 02:52 UTC
It fails for `lcs('axb', 'cxd');`	[reply] [d/l] [select]
Re^3: Finding a LCS module on word level by dragonchild (Archbishop) on May 09, 2008 at 14:30 UTC
Ah, so we have to examine the powerset of substrings and not just the stem. Ok, that's just a change to the loop. I think my point still stands, though. My criteria for good software: Does it work? Can someone else come in, make a change, and be reasonably certain no bugs were introduced?	[reply]
Re^4: Finding a LCS module on word level by Limbic~Region (Chancellor) on May 10, 2008 at 00:41 UTC
Re^5: Finding a LCS module on word level by dragonchild (Archbishop) on May 10, 2008 at 02:58 UTC
Re: Finding a LCS module on word level by thundergnat (Deacon) on May 09, 2008 at 14:17 UTC
Algorithm::Diff seems to do what you are looking for. `use strict; use warnings; use Algorithm::Diff qw/LCSidx/; my $str1 = "I am trying to find a perl LCS module in perl monk"; my $str2 = "perl monk"; print longest( $str1, $str2 ); sub longest { my @seq1 = split /\s+/, $_[0]; my @seq2 = split /\s+/, $_[1]; my ( $idx1, $idx2 ) = LCSidx( \@seq1, \@seq2 ); my @list = @seq1[@$idx1]; return join( ' ', @list ), '[', join( ',', @$idx1 ), ']'; }` [download] Yields: `perl monk[10,11]` [download]	[reply] [d/l] [select]
Re^2: Finding a LCS module on word level by almut (Canon) on May 09, 2008 at 15:16 UTC
Algorithm::Diff finds the longest common subsequence, which is not necessarily the same as the longest common substring. So, I don't think it's what the OP is looking for, even though the result obtained with the given sample strings does look correct... (as in this particular case, the longest common subsequence also is the longest common substring). This is not always true, however. For example, try modifying `$str2` to read "a perl monk". `Algorithm::Diff::LCSidx` will (correctly) identify this longest common subsequence: `a perl monk[5,10,11]` [download] which is not a substring of `$str1`...	[reply] [d/l] [select]
Re: Finding a LCS module on word level by ikegami (Patriarch) on May 10, 2008 at 17:37 UTC
Here's a quick and dirty solution `use String::LCSS_XS qw( lcss ); sub lcssw { my ($s1, $s2) = @_; my $i; my %codes; my %words; for ($s1, $s2) { $_ = join '', map { $codes{$_} = chr(++$i) if !exists($codes{$_}); $codes{$_} } lc($_) =~ /\w+/g; } my $lcss = lcss($s1, $s2); @words{values %codes} = keys %codes; return join ' ', @words{ $lcss =~ /./sg }; }` [download] Untested. Shouldn't be used if the two input strings combined contain more than 254 different words. (The technique can be extended through clever encoding of the number, but you'd be better spending your efforts modifying the module to handle words.) Update: Fixed typo ($lcss_ ⇒ $lcss)	[reply] [d/l]
Re: Finding a LCS module on word level by lima1 (Curate) on May 09, 2008 at 17:01 UTC
See http://en.wikipedia.org/wiki/Longest_common_substring_problem for a dynamic programming algorithm. You will find a perl implementation in this Wikibooks thing. It is char based, but you can easily make it word based by splitting /\s+/ instead of // (don't forget to change $m and $n as well). (This is basically what ikegami already said.)	[reply]
Re^2: Finding a LCS module on word level by Limbic~Region (Chancellor) on May 09, 2008 at 20:05 UTC
lima1, Forget chars vs words for a second. Are you familiar with DP solution for LCS (where S=substring or S=subsequence - I don't care) that works for more than 2 strings? I wrote Longest Common Subsequence because I couldn't find any working examples for > 2 strings but am still interested. Cheers - L~R	[reply]
Re^3: Finding a LCS module on word level by lima1 (Curate) on May 10, 2008 at 10:31 UTC
I am not an expert, but a related problem in bioinformatics is the multiple sequence alignment (which is basically the longest common sequence with sophisticated scoring functions). exact DP solutions are O(n^m) and the implementation gets really ugly for n>2. So people use heuristics. Which one works depends how similar the strings are, how many you want to align etc... A simple and common approach is to cluster the strings in pairs with something like UPGMA, calculate the pair lcs and then calculate the lcs of the pair lcs according some special rules. http://en.wikipedia.org/wiki/Multiple_sequence_alignment. The substring problem is solvable in linear time with suffix arrays/trees. Tree::Suffix claims to have this function.	[reply]