comment on

This brings up a question: What should an lcss() routine return?

All the longest common substrings?
If not:
The 'first'.
From the left-hand end of the string; or right-hand end; or the first discovered by the most efficient algorithm?
If the longest common substring appears multiple times at different offsets;
Should they
All be returned;
Or only the unique ones?
The first found?

Of course, the only real answer is: it depends upon the application requirements. It's easy to suggest that all is the most flexible solution as it allows the application to choose what it needs; but that doesn't help for applications -- like the one I was addressing when I started this thread 7 years ago -- where the application only needs one; knows that there will only be one; and is time critical enough that it cannot afford to waste time searching for others that either it knows won't exist; or it doesn't care about even if they do.

Above you said "Un-optimized, but functionally correct code is better than broken code."; but that's only true if there is a definitive definition of "functionally correct". For example; the lcss_brute() function you've now uploaded; is at least by one definition, 'incorrect'; as it doesn't look for duplicates.

That can be corrected by adding another loop:

sub lcss_brute {
    my( $r1, $r2, $min ) = @_;
    my( $l1, $l2, $swap ) = ( length $$r1, length $$r2, 0 );
    ( $r1, $r2, $l1, $l2, $swap ) = ( $r2, $r1, $l2, $l1, 1 ) if $l1 >
+ $l2;

    my( $best, @solns ) = 0;
    for my $start ( 0 .. $l2 - 1 ) {
        for my $l ( reverse 1 .. $l1 - $start ) {
            my $substr = substr( $$r1, $start, $l );
            my $o = 0;
            while( $o = 1+index( $$r2, $substr, $o ) ) {
                if( $l > $best ) {
                    $best = length $substr;
                    @solns = [ $substr, $start, $o-1 ];
                }
                elsif( $l == $best ) {
                    push @solns, [ $substr, $start, $o-1 ];
                }
            }
        }
    }
    return \@solns;
}
[download]

Which means that for your latest example it now produces similar output to lcss_all():

[ 7:25:50.33] C:\test>lcss-test.pl
[["gh", 1, 7], ["gh", 1, 9], ["da", 3, 4]]
[download]

Though as you can see, because the algorithm operates in a different way to lcss_all(), it produces the matches in a different order. That could be cured by sorting by position, but then you've added another overhead for those applications that simply do not care.

For example: I finally dug out the CD that contained the code I sent to my client, and I discovered why they have never encountered the false hit problem you found in the OP code. In the application, the needles (shorter strings) are always newline terminated; and the haystacks (longer strings) do not contain embedded newlines, so that particular problem can never occur in their application. So for their purposes; my OP algorithm was and is still "functionally correct".

However, their primary concern was that they are/(were) processing huge volumes of data and the lcss() algorithm they were using was responsible for a high proportion of their processing time. The savings from my version of the algorithm combined with some other savings I found reduced their runtime by almost 50%. Slow and exhaustive would have been of no value to them at all.

On the CD I also found that I had coded up an Inline::C version of the OP algorithm that runs even faster; but that would (almost certainly) exhibit the same 'flaw' as the OP version. Thinking back I recall they rejected that as they did not have the skills to maintain C code. None the less; with the addition of the fix I gave above for the false hits problem, that would probably satisfy the needs of many users with a lcss() requirement.

So the problem is: where to go from here?

With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'

Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.

"Science is about questioning the status quo. Questioning authority". I knew I was on the right track :)

In the absence of evidence, opinion is indistinguishable from prejudice.

In reply to Re^8: A better implementation of LCSS? (What should a Longest Common Substring function return?) by BrowserUk
in thread A better implementation of LCSS? by BrowserUk

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.