This brings up a question: What should an lcss() routine return?

  1. All the longest common substrings?

    If not:

  2. The 'first'.

    From the left-hand end of the string; or right-hand end; or the first discovered by the most efficient algorithm?

  3. If the longest common substring appears multiple times at different offsets;

    Should they

  4. All be returned;
  5. Or only the unique ones?
  6. The first found?

Of course, the only real answer is: it depends upon the application requirements. It's easy to suggest that all is the most flexible solution as it allows the application to choose what it needs; but that doesn't help for applications -- like the one I was addressing when I started this thread 7 years ago -- where the application only needs one; knows that there will only be one; and is time critical enough that it cannot afford to waste time searching for others that either it knows won't exist; or it doesn't care about even if they do.

Above you said "Un-optimized, but functionally correct code is better than broken code."; but that's only true if there is a definitive definition of "functionally correct". For example; the lcss_brute() function you've now uploaded; is at least by one definition, 'incorrect'; as it doesn't look for duplicates.

That can be corrected by adding another loop:

sub lcss_brute { my( $r1, $r2, $min ) = @_; my( $l1, $l2, $swap ) = ( length $$r1, length $$r2, 0 ); ( $r1, $r2, $l1, $l2, $swap ) = ( $r2, $r1, $l2, $l1, 1 ) if $l1 > + $l2; my( $best, @solns ) = 0; for my $start ( 0 .. $l2 - 1 ) { for my $l ( reverse 1 .. $l1 - $start ) { my $substr = substr( $$r1, $start, $l ); my $o = 0; while( $o = 1+index( $$r2, $substr, $o ) ) { if( $l > $best ) { $best = length $substr; @solns = [ $substr, $start, $o-1 ]; } elsif( $l == $best ) { push @solns, [ $substr, $start, $o-1 ]; } } } } return \@solns; }

Which means that for your latest example it now produces similar output to lcss_all():

[ 7:25:50.33] C:\test>lcss-test.pl [["gh", 1, 7], ["gh", 1, 9], ["da", 3, 4]]

Though as you can see, because the algorithm operates in a different way to lcss_all(), it produces the matches in a different order. That could be cured by sorting by position, but then you've added another overhead for those applications that simply do not care.

For example: I finally dug out the CD that contained the code I sent to my client, and I discovered why they have never encountered the false hit problem you found in the OP code. In the application, the needles (shorter strings) are always newline terminated; and the haystacks (longer strings) do not contain embedded newlines, so that particular problem can never occur in their application. So for their purposes; my OP algorithm was and is still "functionally correct".

However, their primary concern was that they are/(were) processing huge volumes of data and the lcss() algorithm they were using was responsible for a high proportion of their processing time. The savings from my version of the algorithm combined with some other savings I found reduced their runtime by almost 50%. Slow and exhaustive would have been of no value to them at all.

On the CD I also found that I had coded up an Inline::C version of the OP algorithm that runs even faster; but that would (almost certainly) exhibit the same 'flaw' as the OP version. Thinking back I recall they rejected that as they did not have the skills to maintain C code. None the less; with the addition of the fix I gave above for the false hits problem, that would probably satisfy the needs of many users with a lcss() requirement.

So the problem is: where to go from here?


With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority". I knew I was on the right track :)
In the absence of evidence, opinion is indistinguishable from prejudice.

In reply to Re^8: A better implementation of LCSS? (What should a Longest Common Substring function return?) by BrowserUk
in thread A better implementation of LCSS? by BrowserUk

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.