LCS efficiency problem

zhe has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: LCS efficiency problem by ikegami (Patriarch) on Jun 05, 2008 at 06:34 UTC
Pardon my abruptness in advance. There's a lot of ground to cover. There's an off-by-one error. The code on which you based your implementation assumed @S and @T were indexed 1..m and 1..n respectively. There's three bugs in `$S[$i] =~ m/^$T[$j]$/ig` alone. The 'g' modifier in scalar context without looping == big red warning light. It's a bug here. That expression doesn't do a case-insensitive comparison. You need `\z` instead of `$` for that. `$T[$j]` contains text, not a regexp. You need to convert it to a regexp. What you get is `$S[$i] =~ m/^\Q$T[$j]\E\z/i`. Better yet, make all the strings lowercase and use `eq`. `$S[$i] ne undef` doesn't check if `$S[$i]` is defined. `defined($S[$i])` does. It seems weird that you should have to check for that at all. I'm inclined to believe you have a bug if you need this check. For that matter, having to check for `$S[$i] ne " "` and `$T[$j] ne " "` means something is wrong too. `" "` isn't a word, so it shouldn't be in @S and @T. That leads us to the next point... `split(/\s+/, $s)` is surely better than `split(/ /, $s)`. Even then, it's not very good. I think I'd favour something more like `$s =~ /\w+/g` which wouldn't include punctuation at the end of words. (Actually, it wouldn't include punctuation at all.) `@prod`'s not needed. Since you have the start and the end of the substring, you can use a slice (`@S[$start..$end]`). It seems that you forgot to clean out `@ret = ();`. How come you're not using `use strict;`??? It seems that you forgot to use `my` on `$num` and `@L`. How come you're not using `use strict;`??? You use the same variable (`$n`) for two different things. That's a bad idea. This one is mostly stylistic, but `my $num = scalar @start; for (my $n = 0; $n < $num;$n++)` is much more readable and maintainable when written as `for my $n (0 .. $#start)` That's a really weird return value. But given your need for speed, it might actually make sense. It might help to return a reference to `@LCS` instead of `@LCS` itself. Those are bugs to fix, not performance enhancements (except the tip to make everything lowercase). I didn't even try running the program, Update: Changed `@s[$start .. $end-$start+1]` to `@S[$start..$end]` Update: Added off-by-one error.	[reply] [d/l] [select]
Re^2: LCS efficiency problem by BrowserUk (Patriarch) on Jun 05, 2008 at 06:44 UTC
split(/\s+/, $s) is surely better than split(/ /, $s). Why? Are you forgetting that `split / /, $s` is magical? Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. "Science is about questioning the status quo. Questioning authority". In the absence of evidence, opinion is indistinguishable from prejudice. "Too many [] have been sedated by an oppressive environment of political correctness and risk aversion."	[reply] [d/l]
Re^3: LCS efficiency problem by ikegami (Patriarch) on Jun 05, 2008 at 07:46 UTC
`split / /` isn't magical. You're thinking of `split ' '`. `>perl -le"@a = split / /, 'abc def'; print 0+@a" 3 >perl -le"@a = split ' ', 'abc def'; print 0+@a" 2` [download] `split ' '` is slightly better than `split /\s+/`, but like I said, I wouldn't use it anyway.	[reply] [d/l] [select]
Re^2: LCS efficiency problem by lima1 (Curate) on Jun 06, 2008 at 07:13 UTC
Starting with 1 is part of the dynamic programming algorithm, no bug. `my @S = (undef, map { lc } split(/\w+/, $s));` [download] should fix the wordlists (so that eq is enough). Further filtering as suggested below is a good idea, do that!	[reply] [d/l]
Re^3: LCS efficiency problem by ikegami (Patriarch) on Jun 06, 2008 at 18:31 UTC
Oops, missed the undef. But note that starting at one has nothing to do with the dynamic programming technique. If you wanted to start at zero (say if @S and @T are inputs to the function), then just replace `$L[$i-1][$j-1] \|\|= 0; $L[$i][$j] = $L[$i-1][$j-1] + 1;` [download] with `if ($i && $j) { $L[$i-1][$j-1] \|\|= 0; $L[$i][$j] = $L[$i-1][$j-1] + 1; } else { $L[$i][$j] = 1; }` [download]	[reply] [d/l] [select]
Re: LCS efficiency problem by BrowserUk (Patriarch) on Jun 06, 2008 at 06:43 UTC
How long is it taking now? And what is your target time? The best way of speeding this up, is to use a better algorithm. Naively, discovering the longest common word sequence between two sentences "a b c d" & "p q r" (with letters representing arbitrary words and assuming that a minimum LCWS of 2 words is required), means that you test if "p q r" is found in "a b c d", if not, then "p q", then "q r". If you search the longer string for word subsequences of the shorter, if the shorter string is 3 words, you need to do 3 searches. 4 words, 6 searches. 5 words, 10. Ie. nCr where n is the number of words in the shorter sentence and r=2. So by the time your shorter sentence has 10 words you need 45 string searches, and by 20, 190 searches. But many of these subsequences of the shorter string simply cannot exist in the longer string, because they contain individual words that do not exist there. And vice versa. You are comparing against words in the longer string that do not exist in the shorter. So, given sentences (again letters represent words), "A B C D E F G H I" & "P Q C D E R S A B", then you would be comparing each of these 36 subsequences from the first string against the second: `[A B C D E F G H I][A B C D E F G H][A B C D E F G] [A B C D E F][A B C D E][A B C D][A B C][A B] [B C D E F G H I][B C D E F G H][B C D E F G] [B C D E F][B C D E][B C D][B C] [C D E F G H I][C D E F G H][C D E F G] [C D E F][C D E][C D] [D E F G H I][D E F G H][D E F G][D E F][D E] [E F G H I][E F G H][E F G][E F] [F G H I][F G H][F G] [G H I][G H] [H I]` [download] You can save a considerable number of these searches by breaking each of the string into contiguous fragments that contain only words that exist in the other string, and then performing the LCWS on those fragments. Eg. Fragmenting "A B C D E F G H I" to substring containing only words present in "P Q C D E R S A B", gives "A B", "C D E". Fragmenting "P Q C D E R S A B" to substring containing only words present in "A B C D E F G H I", gives "C D E" & "A B". The work required to find the LCWS "C D E" is considerably less. Realising that such pat examples are not too convincing, here's a more realistic example. Two sanitised sentences drawn from Huckleberry Finn: "when the place couldn't hold no more the duke he quit tending door and went around the back way and come on to the stage and stood up before the curtain and made a little speech and praised up this tragedy and said it was the most thrillingest one that ever was and so he went on abragging about the tragedy and about edmund kean the elder which was to play the main principal part in it and at last when hed got everybodys expectations up high enough he rolled up the curtain and the next minute the king come aprancing out on all fours naked and he was painted all over ringstreakedand striped all sorts of colors as splendid as a rainbow" 123 words. "well when the place couldnt hold no more people the duke he give a fellow a quarter and told him to tend door for him a minute and then he started around for the stage door i after him but the minute we turned the corner and was in the dark he says" 53 words. The shorter of these two sentences contains 53 words. (Don'tcha just love Mark Twain's run-on sentences :). So, the naive approach would require searching the longer sentences 123 words, for all 1378 subsequences of the shorter. Naive solution work factor As a rough measure of the effort/time involved in the above, if we multiply the number of words being searched 123, by the number of words in each of the 1378 subsequences, and sum the results, we get a 'work factor' of: 12353 + 123522 + 123513 + 123504 ... + 123351 + 123252 = 26182 123 = 3,220,386. But, if you fragment those two sentences as above, you get the following two sets of fragments: Sentence A fragments: `[ when the place couldnt hold no more the duke he ] [ door and ] [ around the ] [ to the stage and ] [ was the ] [ was and ] [ was to ] [ and the ] [ minute the ] [ and he was ]` [download] `Sentence B fragments: [ when the place couldnt hold no more ] [ the duke he ] [ a minute and ] [ the stage door ] [ the minute ] [ and was in the ]` [download] Better algorithm work factor If you now compare each of the A fragments against each of the B fragments, and perform similar polynomial calculations for each pairing: `10: [ 6 3 3 3 2 4 ] = 890 2: [ 6 3 3 3 2 4 ] = 246 2: [ 6 3 3 3 2 4 ] = 246 4: [ 6 3 3 3 2 4 ] = 330 2: [ 6 3 3 3 2 4 ] = 246 2: [ 6 3 3 3 2 4 ] = 246 2: [ 6 3 3 3 2 4 ] = 246 2: [ 6 3 3 3 2 4 ] = 246 2: [ 6 3 3 3 2 4 ] = 246 3: [ 6 3 3 3 2 4 ] = 276 =3218` [download] Giving a total work factor of 3218. Or approx. 0.01% of the work. Not totally accurate obviously as it doesn't take into account the effort of fragmenting the strings, but a considerable potential savings anyway. Futher savings And it doesn't stop there. If you sort the fragments in each set, longest first, then you are most likely to discover the LCWS in the early sets and can skip testing when any fragment is shorter than the best you found so far. Worked example The upshot of this is that I ran my algorithm against the 2533 sentences extracted from the Project Gutenberg version of "The Adventures of Huckleberry Finn by Mark Twain" and it found the 189,983 2-word or more LCWSs that result from the ~~5.4 billion~~ 3,206,778 pairings of those sentences in just a few seconds under 10 minutes. Note:The math above is unchecked and may contain errors. The timings and results are correct. Here is a small sample (40 longest) of the LCWSs found: sentence_a_no / sentence_b_no [ longest common word sequence between t +hem ] ( 1354 / 1364 )=>[ when the place couldnt hold no more ] ( 992 / 993 )=>[ and underneath the picture it said ] ( 2037 / 2343 )=>[ it dont make no difference whether ] ( 1237 / 1242 )=>[ with the tears running down their ] ( 534 / 1115 )=>[ i made up my mind i wouldnt ever ] ( 587 / 591 )=>[ looking at me pretty curious and ] ( 347 / 1206 )=>[ towards the middle of the river ] ( 347 / 1365 )=>[ towards the middle of the river ] ( 1206 / 1365 )=>[ towards the middle of the river ] ( 1184 / 1185 )=>[ you aint the only person thats ] ( 19 / 2140 )=>[ i couldnt see no advantage in ] ( 107 / 2065 )=>[ i couldnt make out how he was ] ( 654 / 689 )=>[ to keep from getting run over ] ( 786 / 787 )=>[ aint it natural and right for ] ( 1271 / 1351 )=>[ continental theatres in their ] ( 1586 / 1626 )=>[ the king and the duke come up ] ( 2185 / 2445 )=>[ we went down the lightningrod ] ( 3 / 612 )=>[ i couldnt stand it no longer ] ( 352 / 1155 )=>[ got further and further away ] ( 407 / 521 )=>[ paddled over to the illinois ] ( 829 / 1131 )=>[ it was a monstrous big river ] ( 1468 / 1623 )=>[ around each others necks and ] ( 2099 / 2523 )=>[ with a torchlight procession ] ( 104 / 408 )=>[ i went out in the woods and ] ( 347 / 935 )=>[ a quarter of a mile or more ] ( 393 / 1980 )=>[ wanted to know all about it ] ( 998 / 1237 )=>[ with the tears running down ] ( 998 / 1242 )=>[ with the tears running down ] ( 998 / 1470 )=>[ with the tears running down ] ( 1050 / 1134 )=>[ on tother side of the river ] ( 1237 / 1470 )=>[ with the tears running down ] ( 1242 / 1470 )=>[ with the tears running down ] ( 19 / 104 )=>[ i couldnt see no advantage ] ( 42 / 1613 )=>[ and stretched his neck out ] ( 104 / 2140 )=>[ i couldnt see no advantage ] ( 213 / 521 )=>[ over to the illinois shore ] ( 345 / 739 )=>[ up shore in the easy water ] ( 361 / 1311 )=>[ places on the ground where ] ( 812 / 1136 )=>[ couldnt tell nothing about ] ( 1013 / 1804 )=>[ there in the middle of the ] ( 1652 / 2119 )=>[ aint had no experience and ] [download] Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. "Science is about questioning the status quo. Questioning authority". In the absence of evidence, opinion is indistinguishable from prejudice. "Too many [] have been sedated by an oppressive environment of political correctness and risk aversion."	[reply] [d/l] [select]
Re^2: LCS efficiency problem by ikegami (Patriarch) on Jun 07, 2008 at 08:42 UTC
I don't understand your numbers. LCSS is O(NM), so Time for lcss comparing the two complete sentences = 12353 = 6,519 Time for lcss comparing the powerset of segments = (10+2+2+4+2+2+2+2+2+3) * (7+3+3+3+2+4) = 31*22 = 682	[reply]
Re^3: LCS efficiency problem by BrowserUk (Patriarch) on Jun 07, 2008 at 17:14 UTC
When comparing the sentence, "P Q R" (letters represent words) against "A B C D", you need to test the powerset of the short against the longer. Eg. Does "A B C D" contain "P Q R"? Does "A B C D" contain "P Q"? Does "A B C D" contain "Q R"? That gives 43 + 42 + 42 = 28 (not 43=12). Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. "Science is about questioning the status quo. Questioning authority". In the absence of evidence, opinion is indistinguishable from prejudice. "Too many [] have been sedated by an oppressive environment of political correctness and risk aversion."	[reply]
Re^4: LCS efficiency problem by ikegami (Patriarch) on Jun 07, 2008 at 23:18 UTC
Re^5: LCS efficiency problem by BrowserUk (Patriarch) on Jun 07, 2008 at 23:37 UTC
Some notes below your chosen depth have not been shown here
Re^2: LCS efficiency problem by zhe (Initiate) on Jun 10, 2008 at 07:04 UTC
Hi, BrowserUK, Thanks very much for your reply, I am very interested in understanding it.But I am confused with the part when you fragment those two sentences with the result: Sentence A fragments: when the place couldnt hold no more the duke he door and around the to the stage and was the was and was to and the minute the and he was download Sentence B fragments: when the place couldnt hold no more the duke he a minute and the stage door the minute and was in the could you provide more detailed information on how to fragment the two sentences? Isn't that process also by doing LCS ? Thanks very much	[reply]
Re^3: LCS efficiency problem by BrowserUk (Patriarch) on Jun 10, 2008 at 18:51 UTC
Certainly. I've included my test code at the bottom of this post. And I'll attempt a brief explanation of it. The input to my test code is a file of preprocessed sentences, one per line, with all extraneous characters removed. It reads all the lines into memory, breaking them into arrays of words: `my @sentences = map { [ split ] } <>;` [download] It then builds a parallel array of hashes of uniq words in each sentence: `my @uniq; $uniq[ $_ ] = { map{ $_ => 1} @{ $sentences[ $_ ] } } for 0 .. $#sente +nces;` [download] This is done up-front, as each sentence, and associated hash of uniqs is re-used many times during the run of the code. The main body of the program consists of two nested loops over the indices of the sentences (and associated uniqs hashes), comparing each sentence against every other sentence in the usual way: `my( @aFragments, @bFragments ); for my $ai ( 0 .. $#sentences ) { my $sa = $sentences[ $ai ]; my $ua = $uniq[ $ai ]; for my $bi ( $ai+1 .. $#sentences ) { my $sb = $sentences[ $bi ]; my $ub = $uniq[ $bi ]; ## process sentence[ $ai ] against sentences[ $bi ] } }` [download] The variables `$sa $sb` & `$ua $ub` are references to the sentence arrays and uniq hashes for A & B respectively, and just simplify expressions in the subsequent processing. The variables `@aFragments` & `@bFragments` are produced by calling `fragmentSentence()` passing the sentence arrayref of one sentence and the uniq hashref for the other, for each pairing. `sub fragmentSentence { my( $sa, $ub ) = @_; return reduce{ exists $ub->{ $b } ? push @{ $a->[ $#$a ] }, $b : push @{ $a }, [] ; $a; } [[]], @{ $sa }; }` [download] Works by using reduce to build an array of arrays of contiguous words in sentence A that also appear in sentence B. And vice versa. The central core of the processing then runs a standard LCS algorithm on each of the resultant fragments of sentence A against each of the fragments from sentence B, but taking the opportunity of an early exit at several points when it is obvious that no longer common sequence can be found than has already been seen: ## if there are no common words between the sentences, exit ea +rly next unless first{ exists $ua->{ $_ } } keys %{ $ub }; ## fragment both sentences, discarding one-word fragments ## and sorting them by number of words, longest first. @aFragments = sort{ @$b <=> @$a } grep{ @$_ > 1 } @{ fragmentSentence( $sa, $ub ) }; @bFragments = sort{ @$b <=> @$a } grep{ @$_ > 1 } @{ fragmentSentence( $sb, $ua ) }; ## Exit early if either sentence has no remaining fragments next unless @aFragments and @bFragments; my @best; ## For each A fragment for my $aFrag ( @aFragments ) { ## early exit if this fragment is shorter than the best so + far next if @$aFrag <= @best; ## for each B fragment for my $bFrag ( @bFragments ) { ## Ditto early exit next if @$bFrag <= @best; ## Finally, perform the LCS algorithm my @lcws = lcws( $aFrag, $bFrag ); ## And save if its the longest yet seen @best = @lcws if @lcws > @best } } ## Skip if we didn't find one next unless @best; ## Output the sentence numbers and the best LCWS we found. printf "( %4d / %4d )=>[ %s ]\n", $ai, $bi, join( ' ', @best ) +; [download] Let me know if anything needs clarifying, and also, if it helps your problem. The code: Read more... (3 kB) Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. "Science is about questioning the status quo. Questioning authority". In the absence of evidence, opinion is indistinguishable from prejudice. "Too many [] have been sedated by an oppressive environment of political correctness and risk aversion."	[reply] [d/l] [select]
Re: LCS efficiency problem by Limbic~Region (Chancellor) on Jun 05, 2008 at 22:24 UTC
zhe, Are you looking for the longest common substring (word level) across 4000 sentences or for all possible pairings of sentences You might be interested in Finding a LCS module on word level. Unfortunately, without knowing which task you are trying to accomplish, I can't give any more advice. Cheers - L~R	[reply]
Re: LCS efficiency problem by graff (Chancellor) on Jun 06, 2008 at 22:53 UTC
I'm not sure how the following would compare to the OP in terms of approach or overall efficiency, but it was an interesting exercise, and I tried it on the sample data posted above by BrowserUK, with the single answer LCS word string returned in very little time (consistently less than 0.02 sec on a mac intel core duo, 667 MHz). The version below includes its own small test set, but also accepts two or more file names as command line args, and spits out the single (first) LCS string found for each pairing of input files. It will report an "LCS" of a single "word" whenever that happens to be the longest common string, and it will report "NO_MATCH" when two inputs have nothing at all in common. #!/usr/bin/perl use strict; use warnings; sub lcs_arrays { my ( $a1, $a2 ) = @_; my @matches = (); my $longest = 0; for my $b1 ( 0 .. $#$a1 ) { for my $b2 ( 0 .. $#$a2 ) { if ( $$a1[$b1] eq $$a2[$b2] ) { my @match = ( $$a1[$b1] ); my $e1 = $b1+1; my $e2 = $b2+1; while ( $e1 < @$a1 and $e2 < @$a2 ) { last if ( $$a1[$e1] ne $$a2[$e2] ); push @match, $$a1[$e1]; $e1++; $e2++; } if ( @match > $longest ) { push @matches, \@match; $longest = @match; } } } } return $matches[$#matches]; # array ref -> longest matching list } my @t = ( "now is the time for all good men to come to the aid of", "now is the winter of our discontent made glorious", "the time for all good parties now is the winter", "we expect good men to come to the aid of good parties", "I KNOW THIS SENTENCE WILL NOT MATCH", ); if ( @ARGV > 1 and -f $ARGV[0] ) { local $/; @t = (); for my $f ( @ARGV ) { open( I, $f ); push @t, <>; close I; } } for my $i ( 0 .. $#t-1 ) { for my $j ( $i+1 .. $#t ) { my @a = split " ", $t[$i]; my @b = split " ", $t[$j]; my $r = lcs_arrays( \@a, \@b ); if ( defined $r ) { print "$i-$j: @$r\n"; } else { print "$i=$j: NO_MATCHES\n"; } } } [download] Note that the "lcs_arrays" sub assumes that its inputs have already been arranged into arrays according to whatever tokenization strategy is appropriate to a given task. (This means you could use it for character-based comparisons as well as word-based, but there are already other modules available for doing that sort of work.) I think separating the tokenization from the LCS algorithm is a useful thing. In this example, the "main" part of my test script simply splits on whitespace, but you might want to do that part differently, e.g. removing punctuation characters (brackets, commas, periods, quotes, etc) from the left and right edges of each word, folding case, and/or other stuff like that. UPDATE: Why aren't you using the "LCS" function of Algorithm::Diff? It works on arrays, just like the toy function I've posted here, so it's just a matter of how you populate the arrays. Of course, having just played with that a bit, I see now that I might be confused about the "proper" definition of the term "longest common substring". Given two input lists: `list1: one two three five six seven list2: two three five eight five six seven` [download] Algorithm::Diff::LCS will return "two three five six seven", whereas my toy function above will return just "two three five". I presume you know what you mean by "LCS", but you should be careful of what other people might mean by it (esp. if they, like me, might be confused about what the "proper" definition should be).	[reply] [d/l] [select]
Re^2: LCS efficiency problem by ikegami (Patriarch) on Jun 07, 2008 at 09:01 UTC
You're doing lots of work twice. Specifically, the indexes over which your `while` iterates. Your solution is O(NM²) while it could be O(NM) like the OP's. `sub lcs_arrays { my ( $a1, $a2 ) = @_; my @matches = [ 0, -1 ]; my $longest = 0; my @last; my @this; for my $b1 ( 0 .. $#$a1 ) { @last = @this; @this = (); for my $b2 ( 0 .. $#$a2 ) { if ( $$a1[$b1] eq $$a2[$b2] ) { my $e2 = $b2+1; $last[$e2-1] \|\|= 0; $this[$e2] = $last[$e2-1] + 1; if ($this[$e2] > $longest) { $longest = $this[$e2]; @matches = (); } if ($this[$e2] == $longest) { push @matches, [ ($b1-$longest+1), $b1 ]; } } } } my ($beg, $end) = @{$matches[0]}; return [ @{$a1}[ $beg..$end ] ]; }` [download] Or since you're only returning one of the longest, `sub lcs_arrays { my ( $a1, $a2 ) = @_; my $beg = 0; my $end = -1; my $longest = 0; my @last; my @this; for my $b1 ( 0 .. $#$a1 ) { @last = @this; @this = (); for my $b2 ( 0 .. $#$a2 ) { if ( $$a1[$b1] eq $$a2[$b2] ) { my $e2 = $b2+1; $last[$e2-1] \|\|= 0; $this[$e2] = $last[$e2-1] + 1; if ($this[$e2] > $longest) { $longest = $this[$e2]; $beg = $b1-$longest+1; $end = $b1; } } } } return [ @{$a1}[ $beg..$end ] ]; }` [download] Update: Tested. Added fixes.	[reply] [d/l] [select]
Re^2: LCS efficiency problem by BrowserUk (Patriarch) on Jun 07, 2008 at 01:01 UTC
I tried it on the sample data posted above by BrowserUK, with the single answer LCS word string returned in very little time (consistently less than 0.02 sec Cool. Now think about doing all the 3,206,778 pairing of 2533 sentences. 3,206,778 * 0.02 = 64135.56 seconds = 17 hrs 49 minutes. That's why I was somewhat impressed with my "few seconds under 10 minutes". If you would like the 2533 sentences I extracted from Huckleberry Finn in order to perform a real comparison, /msg me an email id and I'll forward it to you. Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. "Science is about questioning the status quo. Questioning authority". In the absence of evidence, opinion is indistinguishable from prejudice. "Too many [] have been sedated by an oppressive environment of political correctness and risk aversion."	[reply]
Re^2: LCS efficiency problem by ikegami (Patriarch) on Jun 07, 2008 at 06:10 UTC
Algorithm::Diff's function finds the Longest Common Subsequence (LCS). The OP wants the Longest Common Substring. (It's been called LCSS here, but google doesn't recognize that abbreviation.)	[reply]
Re: LCS efficiency problem by throop (Chaplain) on Jun 08, 2008 at 02:41 UTC
I've seen some very impressive Perl code in the answers so far. I agree that the best answer to the OP is use a better algorithm. I'm not up to coding a better answer. But I suspect that the answers so far could be speeded up further by using the 'jump tables' trick in the Boyer Moore String Search algorithm.	[reply]
Re^2: LCS efficiency problem by zhe (Initiate) on Jun 10, 2008 at 06:58 UTC
Hi, all, Thanks very much for all your replies.Learned a lot from you guys. I have two files,each of them is sentences.For example file1 has sentences: A C D E G F D C B F D E A A A C F D B A file2 has sentences: A C D F D B F D D the result would be for each of the sentence in file 1 matching each of the sentence in file 2, return the longest common substring and their index on a word level between the two sentences. RESULT: for sentence 1 in file 1 LCS :A C D0,3 F D5,6 F D5,6 Original Sentence:A C D E G F D sentence 2 in file 1 LCS :A5,5 C0,0 D3,3 F D3,4 F D 3,4 Original Sentence:C B F D E A sentence 3 in file 1 LCS :A C1,2 F D B3,5 F D3,4 Original Sentence:A A C F D B A So is there any way to make this process more efficiency?	[reply]

The best way of speeding this up, is to use a better algorithm.

You can save a considerable number of these searches

Two sanitised sentences drawn from Huckleberry Finn:

Naive solution work factor

Better algorithm work factor

Futher savings

Worked example

Here is a small sample (40 longest) of the LCWSs found: