http://qs1969.pair.com?node_id=249239

rkg has asked for the wisdom of the Perl Monks concerning the following question:

Hi --

I am looking for a module to solve (or appoximately solve) the longest common substring problem.

I found two modules on CPAN: String-LCSS-0.10 and String-Ediff-0.01.

The first didn't download ("File not found") from CPAN; the second required a C compiler and I don't have one available (Activestate perl). Any have any suggestions?

Thanks

R

Replies are listed 'Best First'.
Re: Longest Common Substring
by nothingmuch (Priest) on Apr 09, 2003 at 15:25 UTC
    Algorithm::Diff is probably of use. It's pure perl, and nearly every other module i found relies on it.

    -nuffin
    zz zZ Z Z #!perl
Re: Longest Common Substring
by glwtta (Hermit) on Apr 09, 2003 at 15:36 UTC
    String::LCSS certainly seems to be what you want. And I don't seem to be having any trouble getting it from CPAN.
Re: Longest Common Substring
by BrowserUk (Patriarch) on Apr 09, 2003 at 16:24 UTC

    Probably not the most efficient implementation (and maybe incomplete), but if you can't get the modules to work, you could wrap this up as a subroutine.

    #! perl -slw use strict; use Data::Dumper; my $needle = 'the lazy cat'; my $haystack = 'the quick brown fox jumps over the lazy dog'; my @matches; for my $start (0..length $needle) { for my $len ($start+1 .. length $needle) { my $substr = substr($needle, $start, $len); push @matches, $haystack =~ m[($substr)]g; } } my ($len, $longest) = 0; length > $len and ($longest, $len) = ($_, length) for @matches; print "The longest common substring between\n", $needle, "\nand\n", $h +aystack, "\nis '$longest'"; __END__ C:\test>249239.pl The longest common substring between the lazy cat and the quick brown fox jumps over the lazy dog is 'the lazy ' C:\test>

    Examine what is said, not who speaks.
    1) When a distinguished but elderly scientist states that something is possible, he is almost certainly right. When he states that something is impossible, he is very probably wrong.
    2) The only way of discovering the limits of the possible is to venture a little way past them into the impossible
    3) Any sufficiently advanced technology is indistinguishable from magic.
    Arthur C. Clarke.
      BrowserUK's solution doesn't quite work I tried the following and did not get back what was expected ( " the quick brown fox "):
      my $needle = 'i pushed the lazy dog into a creek, the quick brown fox +told me to'; my $haystack = 'the quick brown fox jumps over the lazy dog';
      and got back 'the lazy dog' (it works if you reverse the $needle and the $haystack.

      But to his credit it seems to work better than String-LCSS-0.10, and he did mention that it might be incomplete (run it both ways, get the longest string and print that, I believe should give you the most right answer, but my tests might also be incomplete).

      Where which running the following code:

      I would use Algorithm-Diff using this part of the documentation as a starting point.

      Note: The Author of String-LCSS-0.10 has been notified.

      -enlil

Re: Longest Common Substring
by BrowserUk (Patriarch) on Apr 09, 2003 at 20:13 UTC

    Enlil's right, my code was a bit slovenly. I believe this is both more correct and more efficient, but I can't guarentee it doesn't fail on some edge case or another.

    #! perl -slw use strict; sub lcss (\$\$) { my ($needle, $haystack) = @_; ($needle, $haystack) = ($haystack, $needle) if length $$needle > length $$haystack; my ($longest_c, $longest) = 0; for my $start (0..length $$needle) { for my $len ( reverse $start+1 .. length $$needle) { my $substr = substr($$needle, $start, $len); length $1 > $longest_c and ($longest_c, $longest) = (lengt +h $1, $1) while $$haystack =~ m[($substr)]g; } } return $longest; } my $needle = 'the lazy cat'; my $haystack = 'the quick brown fox jumps over the lazy dog'; print "The longest common substring between\n" , $needle, "\nand\n" , $haystack, "\nis '" , lcss($needle, $haystack) , "'"; $needle = 'i pushed the lazy dog into a creek, the quick brown fox tol +d me to'; $haystack = 'the quick brown fox jumps over the lazy dog'; print "The longest common substring between\n" , $needle, "\nand\n" , $haystack, "\nis '" , lcss($needle, $haystack) , "'"; __END__ C:\test>249239.pl The longest common substring between the lazy cat and the quick brown fox jumps over the lazy dog is 'the lazy ' The longest common substring between i pushed the lazy dog into a creek, the quick brown fox told me to and the quick brown fox jumps over the lazy dog is 'the quick brown fox '

    Examine what is said, not who speaks.
    1) When a distinguished but elderly scientist states that something is possible, he is almost certainly right. When he states that something is impossible, he is very probably wrong.
    2) The only way of discovering the limits of the possible is to venture a little way past them into the impossible
    3) Any sufficiently advanced technology is indistinguishable from magic.
    Arthur C. Clarke.
Re: Longest Common Substring
by PodMaster (Abbot) on Apr 10, 2003 at 00:53 UTC
    I don't know if it was you , but somebody has requested String-Ediff and I have put it up on my repository.

    If it was more than a single function it would've been impossible for me to do since the guy who wrote it used swig, and well, swig on win32 hasn't been working out for me. I ended up writing an XS port (pretty easily I might add), and the source can be found at http://crazyinsomniac.perlmonk.org/perl/files/String-Ediff-0.01.tar.gz.


    MJD says you can't just make shit up and expect the computer to know what you mean, retardo!
    I run a Win32 PPM repository for perl 5.6x+5.8x. I take requests.
    ** The Third rule of perl club is a statement of fact: pod is sexy.

LCS != LCSS
by rkg (Hermit) on Apr 11, 2003 at 12:38 UTC
    Hi, all. Thanks for the help. After spending some time with Algorithm::Diff, I've realized that Alg::Diff's LCS(Longest Common SubSequence) is not the same as String-LCSS (Longest Common Substring). Just wanted to share that (perhaps obvious) observation. LCSS does what I need, and is working fine for me...thanks!

    R