st_ale has asked for the wisdom of the Perl Monks concerning the following question:

Hi,
I am new here and I am trying to find a module providing the Longest Common Substring on a word level. for example,
str1 = "I am trying to find a perl LCS module in perl monk"; str2 = "perl monk"
the return would be "perl monk[10,11]" Can anybody help?

I have tried Algorithm::LCS, but it gives the result of Longest Common Subsequence but not Substring.As the example above, by using Algorithm::LCS, the result would be "perl monk[6,11]", which is not what I want... And I have also tried String::LCSS, but that is on a character level. So can anybody help me find the right module.

Thanks in advance.

Replies are listed 'Best First'.
Re: Finding a LCS module on word level
by Limbic~Region (Chancellor) on May 09, 2008 at 02:17 UTC
    st_ale,
    I adapted the code I wrote Re^2: Longest Common Subsequence to work on words instead of characters (and s/subsequence/substring/). It works, but I am sure it is the best implementation. I took a solution I wrote for finding the longest common subsequence for more than two strings, adapted it for longest common substring for more than two strings, and then adapted it again to use words than rather than characters. If you want the actual position in the string of the words, you will need to mess with %map.

    Cheers - L~R

Re: Finding a LCS module on word level
by educated_foo (Vicar) on May 08, 2008 at 23:58 UTC
      Yes, I have tried, but as I mentioned above, like the string1 "perl find perl module" and string2 "perl module", it will return me the "perl" in the front and the "module"in the back, instead of return me the consecutive "perl module" in the back.This is what I do not expect... Thanks.
        My bad -- I was thinking of Algorithm::Diff, which gives you what you want, specifically LCSidx.
        Your question is still a little unclear to me, but this will give you the consecutive "perl module".

        #!/usr/bin/perl use strict; use warnings; use String::LCSS_XS qw(lcss); my $longest = lcss ( "perl find perl module", "perl module"); print $longest, "\n";
Re: Finding a LCS module on word level
by ikegami (Patriarch) on May 09, 2008 at 02:59 UTC

    So why don't you copy String::LCSS and modify it to work on words instead of characters? The implementation finds the longuest common substring (LCSS) of characters, but the algorithm can just as easily find the LCSS of words.

    Update: Turns out String::LCSS is buggy and unmaintained. It doesn't work for inputs of the form lcss("b", "ab"); (like your example) and other inputs. It's apparently using an inefficient algorithm too, according to one of the bug reports (which actually provides an implementation of a better algorithm).

Re: Finding a LCS module on word level
by dragonchild (Archbishop) on May 09, 2008 at 00:53 UTC
    I may be missing something, but isn't this kinda like (untested!):
    sub lcs { my ($string, $find) = @_; my $l = length $find; while ( $l > 0 ) { my $pos = index( $string, $find ); if ( $pos > -1 ) { return ( $pos, $l ); } $l--; chop $find; } return; }

    My criteria for good software:
    1. Does it work?
    2. Can someone else come in, make a change, and be reasonably certain no bugs were introduced?
      It fails for lcs('axb', 'cxd');
        Ah, so we have to examine the powerset of substrings and not just the stem. Ok, that's just a change to the loop. I think my point still stands, though.

        My criteria for good software:
        1. Does it work?
        2. Can someone else come in, make a change, and be reasonably certain no bugs were introduced?
Re: Finding a LCS module on word level
by thundergnat (Deacon) on May 09, 2008 at 14:17 UTC

    Algorithm::Diff seems to do what you are looking for.

    use strict; use warnings; use Algorithm::Diff qw/LCSidx/; my $str1 = "I am trying to find a perl LCS module in perl monk"; my $str2 = "perl monk"; print longest( $str1, $str2 ); sub longest { my @seq1 = split /\s+/, $_[0]; my @seq2 = split /\s+/, $_[1]; my ( $idx1, $idx2 ) = LCSidx( \@seq1, \@seq2 ); my @list = @seq1[@$idx1]; return join( ' ', @list ), '[', join( ',', @$idx1 ), ']'; }

    Yields:

    perl monk[10,11]

      Algorithm::Diff finds the longest common subsequence, which is not necessarily the same as the longest common substring. So, I don't think it's what the OP is looking for, even though the result obtained with the given sample strings does look correct... (as in this particular case, the longest common subsequence also is the longest common substring). This is not always true, however. For example, try modifying $str2 to read "a perl monk". Algorithm::Diff::LCSidx will (correctly) identify this longest common subsequence:

      a perl monk[5,10,11]

      which is not a substring of $str1...

Re: Finding a LCS module on word level
by ikegami (Patriarch) on May 10, 2008 at 17:37 UTC
    Here's a quick and dirty solution
    use String::LCSS_XS qw( lcss ); sub lcssw { my ($s1, $s2) = @_; my $i; my %codes; my %words; for ($s1, $s2) { $_ = join '', map { $codes{$_} = chr(++$i) if !exists($codes{$_}); $codes{$_} } lc($_) =~ /\w+/g; } my $lcss = lcss($s1, $s2); @words{values %codes} = keys %codes; return join ' ', @words{ $lcss =~ /./sg }; }

    Untested. Shouldn't be used if the two input strings combined contain more than 254 different words. (The technique can be extended through clever encoding of the number, but you'd be better spending your efforts modifying the module to handle words.)

    Update: Fixed typo ($lcss_ ⇒ $lcss)

Re: Finding a LCS module on word level
by lima1 (Curate) on May 09, 2008 at 17:01 UTC
    See http://en.wikipedia.org/wiki/Longest_common_substring_problem for a dynamic programming algorithm. You will find a perl implementation in this Wikibooks thing. It is char based, but you can easily make it word based by splitting /\s+/ instead of // (don't forget to change $m and $n as well).

    (This is basically what ikegami already said.)

      lima1,
      Forget chars vs words for a second. Are you familiar with DP solution for LCS (where S=substring or S=subsequence - I don't care) that works for more than 2 strings? I wrote Longest Common Subsequence because I couldn't find any working examples for > 2 strings but am still interested.

      Cheers - L~R

        I am not an expert, but a related problem in bioinformatics is the multiple sequence alignment (which is basically the longest common sequence with sophisticated scoring functions). exact DP solutions are O(n^m) and the implementation gets really ugly for n>2. So people use heuristics. Which one works depends how similar the strings are, how many you want to align etc... A simple and common approach is to cluster the strings in pairs with something like UPGMA, calculate the pair lcs and then calculate the lcs of the pair lcs according some special rules. http://en.wikipedia.org/wiki/Multiple_sequence_alignment.

        The substring problem is solvable in linear time with suffix arrays/trees. Tree::Suffix claims to have this function.