in reply to Re^7: A better implementation of LCSS? (Do you have any combinatorics expertise to bring to bear?)
in thread A better implementation of LCSS?

This brings up a question: What should an lcss() routine return?

  1. All the longest common substrings?

    If not:

  2. The 'first'.

    From the left-hand end of the string; or right-hand end; or the first discovered by the most efficient algorithm?

  3. If the longest common substring appears multiple times at different offsets;

    Should they

  4. All be returned;
  5. Or only the unique ones?
  6. The first found?

Of course, the only real answer is: it depends upon the application requirements. It's easy to suggest that all is the most flexible solution as it allows the application to choose what it needs; but that doesn't help for applications -- like the one I was addressing when I started this thread 7 years ago -- where the application only needs one; knows that there will only be one; and is time critical enough that it cannot afford to waste time searching for others that either it knows won't exist; or it doesn't care about even if they do.

Above you said "Un-optimized, but functionally correct code is better than broken code."; but that's only true if there is a definitive definition of "functionally correct". For example; the lcss_brute() function you've now uploaded; is at least by one definition, 'incorrect'; as it doesn't look for duplicates.

That can be corrected by adding another loop:

sub lcss_brute { my( $r1, $r2, $min ) = @_; my( $l1, $l2, $swap ) = ( length $$r1, length $$r2, 0 ); ( $r1, $r2, $l1, $l2, $swap ) = ( $r2, $r1, $l2, $l1, 1 ) if $l1 > + $l2; my( $best, @solns ) = 0; for my $start ( 0 .. $l2 - 1 ) { for my $l ( reverse 1 .. $l1 - $start ) { my $substr = substr( $$r1, $start, $l ); my $o = 0; while( $o = 1+index( $$r2, $substr, $o ) ) { if( $l > $best ) { $best = length $substr; @solns = [ $substr, $start, $o-1 ]; } elsif( $l == $best ) { push @solns, [ $substr, $start, $o-1 ]; } } } } return \@solns; }

Which means that for your latest example it now produces similar output to lcss_all():

[ 7:25:50.33] C:\test>lcss-test.pl [["gh", 1, 7], ["gh", 1, 9], ["da", 3, 4]]

Though as you can see, because the algorithm operates in a different way to lcss_all(), it produces the matches in a different order. That could be cured by sorting by position, but then you've added another overhead for those applications that simply do not care.

For example: I finally dug out the CD that contained the code I sent to my client, and I discovered why they have never encountered the false hit problem you found in the OP code. In the application, the needles (shorter strings) are always newline terminated; and the haystacks (longer strings) do not contain embedded newlines, so that particular problem can never occur in their application. So for their purposes; my OP algorithm was and is still "functionally correct".

However, their primary concern was that they are/(were) processing huge volumes of data and the lcss() algorithm they were using was responsible for a high proportion of their processing time. The savings from my version of the algorithm combined with some other savings I found reduced their runtime by almost 50%. Slow and exhaustive would have been of no value to them at all.

On the CD I also found that I had coded up an Inline::C version of the OP algorithm that runs even faster; but that would (almost certainly) exhibit the same 'flaw' as the OP version. Thinking back I recall they rejected that as they did not have the skills to maintain C code. None the less; with the addition of the fix I gave above for the false hits problem, that would probably satisfy the needs of many users with a lcss() requirement.

So the problem is: where to go from here?


With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority". I knew I was on the right track :)
In the absence of evidence, opinion is indistinguishable from prejudice.
  • Comment on Re^8: A better implementation of LCSS? (What should a Longest Common Substring function return?)
  • Select or Download Code

Replies are listed 'Best First'.
Re^9: A better implementation of LCSS? (What should a Longest Common Substring function return?)
by toolic (Bishop) on Nov 17, 2015 at 18:08 UTC
    What should an lcss() routine return?
    That is precisely the cause of my aforementioned ice cream headache. I agree that the answer depends on the specific requirements of the end user.
    where to go from here?
    My original goal from about a week ago was to improve the existing CPAN String::LCSS module. There is no dispute that it has a bug; it returns the wrong answer for some reasonable pairs of input strings. For example, given the 2 input strings, "abc" and "abcd", it should return "abc" as the longest common substring. But it returns undef, meaning that it did not find a common substring. That is why I attached a patch for that module. Your brute code fixes that bug and the other undisputed bugs reported against it. This is the simple case where there is only one longest common substring.

    The separate issue arises when there are multiple longest common substrings. I don't know what the behavior should be in this case. However, the module documentation should explicitly state the behavior in this case. All the related CPAN modules lack sufficient detail in their POD.

    I still plan to forge ahead with the arduous process of contacting the author of String::LCSS to either upload something new or grant co-maintainership.

    UPDATE: 2016 jan 1, I was grant co-maintainership, and I uploaded String::LCSS version 1.00 which fixes these bugs.