in reply to A better implementation of LCSS?
my $s1 = 'xxxyyxxy'; my $s2 = 'yyyxyxx'; my( $m, $o1, $o2 ) = lcssN($s1, $s2, 1); print "$m, $o1, $o2\n"; __END__ Prints: yxxy, 4, 4 But, I expect yyx (as String::LCSS_XS produces). yxxy is not a substring of $s2.
|
|---|
| Replies are listed 'Best First'. | |
|---|---|
|
Re^2: A better implementation of LCSS? (Yes)
by BrowserUk (Patriarch) on Nov 13, 2015 at 04:41 UTC | |
Yes. You found a bug. A simpler example is 'abcdefg' & 'abcdefga'. What happens is this. To speed up the processing, the code xors the longer input with a string that contain the shorter string replicated until is is longer than the longer string. Ie. if you have 'the quick brown fox' & 'brown', the shorter is replicated and xored with the longer like so:
Then the xored result is scanned looking for contiguous runs of zeros the length of the shorter string. In this case '00000'. In your case and my example above, the process of replicating the shorter string creates false matches:
Which makes it amazing to me that the guys I originally wrote the code for have never come back to me. I'm not sure I even know how to contact them again. The obvious solution is to throw away this 'optimisation' and use another nested loop; at which point the performance gain that was the code's raison detre probably disappears :( A first pass at not throwing away the performance gain is this:
I haven't tested what affect that has on performance. Update:This also works for the example you posted but I haven't convinced myself that it won't fail for other inputs yet:
With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority". I knew I was on the right track :)
In the absence of evidence, opinion is indistinguishable from prejudice.
| [reply] [d/l] [select] |
by toolic (Bishop) on Nov 14, 2015 at 03:32 UTC | |
The test data is a collection of all the CPAN test files for the 3 modules (Algorithm::LCSS, String::LCSS, String::LCSS_XS). I added the checks which caused problems for the BrowserUk code, plus a few extras so far. The test script can be used to run these checks on the various LCSS subs, one at a time. I included the 3 variants of BrowserUk's code, Perl code I found on wikipedia, plus this compact regex version: Longest common substring. The wiki, the regex and String::LCSS_XS pass all checks. What I'd really like is a way to generate tests automatically. Generating input strings is straightforward, but generating expect values is tricky without a reference model. I tried to stfw for ready-made groups of input strings and expect values, but found nothing. I may just go ahead and use one of the 3 that have no known bugs yet as a reference model. Here is the lcss.pl script: Read more... (8 kB)
Here is the test data file test.txt: Read more... (2 kB)
| [reply] [d/l] [select] |
by BrowserUk (Patriarch) on Nov 14, 2015 at 09:44 UTC | |
What I'd really like is a way to generate tests automatically I agree 100% that a testcase generator is the only sure way to test optimisations for this type of algorithm. However, I think I would approach that goal from a somewhat different angle. Firstly, I believe that it is relatively easy to code (if not so easy to prove) an exhaustive (thus slow) algorithm that is guaranteed to inspect all possibilities; and by extension find all 'best' solutions. Here is (what I believe to be) such an implementation:
Supplied with the bug you found, it produces both valid solutions::
That, if others agree it can neither miss solutions nor generate false ones, takes care of the "reference model". Then comes the problem of ensuring you generate a set of tests that is guaranteed (or at least very likely) to explore all the possible failure modes for optimised implementations. Initially, that seems like a 'how long is a piece of string' (substring:) problem; but thinking back to the little reading I've done on DFA/NFA theory, in particular the observation that for algorithmic proof purposes, single letters are often substituted ("without loss of generality") for unique substrings -- eg. 'aba' is equivalent to both 'the quick the' & 'onetwoone' etc. -- then I believe that it is possible to exercise all possible failure modes by iterating all the permutations of a relatively limited set of characters to produce the 'first string' inputs; and then iterating all the variations of all the substrings of that first string to produce the 'second string' inputs. Please note that my use of the terms 'permutations' and 'variations' in the above is in their common understanding meanings rather than their specific meanings in the combinatorics sense. That is to say, I have yet worked out whether the above should use permutations() or combinations() or variations() or *_with_repetitions() for the two parts of the generation. To this end, I've come up with this as a first pass at an exhaustive testcase generator:
Running that against all the permutation of 'a'..'e' as the first input; and all the variations with repetition of all lengths of substring of the generated first strings, runs in around a minute. I'm not sure yet, but that may be enough? I'm not convinced whether I should be using permutations or combinations; and maybe it could be just variations rather than variations with repetitions; and I'm not yet convinced whether 5 characters is enough; but may be others have some inputs at this point? An interesting exercise would be to run this against the known failing implementations and see if it detects problems; and if those problems it detects can be logically reduced to be the same as the known problems. Anyway, that's as far as my logic will allow me to go at this point; let me know what you (or anyone else still watching) think. With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority". I knew I was on the right track :)
In the absence of evidence, opinion is indistinguishable from prejudice.
| [reply] [d/l] [select] |
by toolic (Bishop) on Nov 15, 2015 at 14:08 UTC | |
by BrowserUk (Patriarch) on Nov 16, 2015 at 16:20 UTC | |
| |