comment on

How long is it taking now? And what is your target time?

The best way of speeding this up, is to use a better algorithm.

Naively, discovering the longest common word sequence between two sentences "a b c d" & "p q r" (with letters representing arbitrary words and assuming that a minimum LCWS of 2 words is required), means that you test if "p q r" is found in "a b c d", if not, then "p q", then "q r".

If you search the longer string for word subsequences of the shorter, if the shorter string is 3 words, you need to do 3 searches. 4 words, 6 searches. 5 words, 10. Ie. nCr where n is the number of words in the shorter sentence and r=2. So by the time your shorter sentence has 10 words you need 45 string searches, and by 20, 190 searches.

But many of these subsequences of the shorter string simply cannot exist in the longer string, because they contain individual words that do not exist there. And vice versa. You are comparing against words in the longer string that do not exist in the shorter.

So, given sentences (again letters represent words), "A B C D E F G H I" & "P Q C D E R S A B", then you would be comparing each of these 36 subsequences from the first string against the second:

[A B C D E F G H I][A B C D E F G H][A B C D E F G]
[A B C D E F][A B C D E][A B C D][A B C][A B]

[B C D E F G H I][B C D E F G H][B C D E F G]
[B C D E F][B C D E][B C D][B C]

[C D E F G H I][C D E F G H][C D E F G]
[C D E F][C D E][C D]

[D E F G H I][D E F G H][D E F G][D E F][D E]

[E F G H I][E F G H][E F G][E F]

[F G H I][F G H][F G]

[G H I][G H]

[H I]
[download]

You can save a considerable number of these searches

by breaking each of the string into contiguous fragments that contain only words that exist in the other string, and then performing the LCWS on those fragments. Eg.

Fragmenting "A B C D E F G H I" to substring containing only words present in "P Q C D E R S A B",
gives "A B", "C D E".
Fragmenting "P Q C D E R S A B" to substring containing only words present in "A B C D E F G H I",
gives "C D E" & "A B".

The work required to find the LCWS "C D E" is considerably less. Realising that such pat examples are not too convincing, here's a more realistic example.

Two sanitised sentences drawn from Huckleberry Finn:

"when the place couldn't hold no more the duke he quit tending door and went around the back way and come on to the stage and stood up before the curtain and made a little speech and praised up this tragedy and said it was the most thrillingest one that ever was and so he went on abragging about the tragedy and about edmund kean the elder which was to play the main principal part in it and at last when hed got everybodys expectations up high enough he rolled up the curtain and the next minute the king come aprancing out on all fours naked and he was painted all over ringstreakedand striped all sorts of colors as splendid as a rainbow"
123 words.
"well when the place couldnt hold no more people the duke he give a fellow a quarter and told him to tend door for him a minute and then he started around for the stage door i after him but the minute we turned the corner and was in the dark he says"
53 words.

The shorter of these two sentences contains 53 words. (Don'tcha just love Mark Twain's run-on sentences :). So, the naive approach would require searching the longer sentences 123 words, for all 1378 subsequences of the shorter.

Naive solution work factor

As a rough measure of the effort/time involved in the above, if we multiply the number of words being searched 123, by the number of words in each of the 1378 subsequences, and sum the results, we get a 'work factor' of: 123*53 + 123*52*2 + 123*51*3 + 123*50*4 ... + 123*3*51 + 123*2*52 = 26182 * 123 = 3,220,386.

But, if you fragment those two sentences as above, you get the following two sets of fragments:

Sentence A fragments:

[ when the place couldnt hold no more the duke he ]
[ door and ]
[ around the ]
[ to the stage and ]
[ was the ]
[ was and ]
[ was to ]
[ and the ]
[ minute the ]
[ and he was ]
[download]

Sentence B fragments:
[ when the place couldnt hold no more ]
[ the duke he ]
[ a minute and ]
[ the stage door ]
[ the minute ]
[ and was in the ]
[download]

Better algorithm work factor

If you now compare each of the A fragments against each of the B fragments, and perform similar polynomial calculations for each pairing:

10: [ 6 3 3 3 2 4 ] = 890
2: [ 6 3 3 3 2 4 ]  = 246
2: [ 6 3 3 3 2 4 ]  = 246
4: [ 6 3 3 3 2 4 ]  = 330
2: [ 6 3 3 3 2 4 ]  = 246
2: [ 6 3 3 3 2 4 ]  = 246
2: [ 6 3 3 3 2 4 ]  = 246
2: [ 6 3 3 3 2 4 ]  = 246
2: [ 6 3 3 3 2 4 ]  = 246
3: [ 6 3 3 3 2 4 ]  = 276
                    =3218
[download]

Giving a total work factor of 3218. Or approx. 0.01% of the work. Not totally accurate obviously as it doesn't take into account the effort of fragmenting the strings, but a considerable potential savings anyway.

Futher savings

And it doesn't stop there. If you sort the fragments in each set, longest first, then you are most likely to discover the LCWS in the early sets and can skip testing when any fragment is shorter than the best you found so far.

Worked example

The upshot of this is that I ran my algorithm against the 2533 sentences extracted from the Project Gutenberg version of "The Adventures of Huckleberry Finn by Mark Twain" and it found the 189,983 2-word or more LCWSs that result from the ~~5.4 billion~~ 3,206,778 pairings of those sentences in just a few seconds under 10 minutes.

Note:The math above is unchecked and may contain errors. The timings and results are correct.

Here is a small sample (40 longest) of the LCWSs found:

sentence_a_no / sentence_b_no [ longest common word sequence between t
+hem ]

( 1354 / 1364 )=>[ when the place couldnt hold no more ]
(  992 /  993 )=>[ and underneath the picture it said ]
( 2037 / 2343 )=>[ it dont make no difference whether ]
( 1237 / 1242 )=>[ with the tears running down their ]
(  534 / 1115 )=>[ i made up my mind i wouldnt ever ]
(  587 /  591 )=>[ looking at me pretty curious and ]
(  347 / 1206 )=>[ towards the middle of the river ]
(  347 / 1365 )=>[ towards the middle of the river ]
( 1206 / 1365 )=>[ towards the middle of the river ]
( 1184 / 1185 )=>[ you aint the only person thats ]
(   19 / 2140 )=>[ i couldnt see no advantage in ]
(  107 / 2065 )=>[ i couldnt make out how he was ]
(  654 /  689 )=>[ to keep from getting run over ]
(  786 /  787 )=>[ aint it natural and right for ]
( 1271 / 1351 )=>[ continental theatres in their ]
( 1586 / 1626 )=>[ the king and the duke come up ]
( 2185 / 2445 )=>[ we went down the lightningrod ]
(    3 /  612 )=>[ i couldnt stand it no longer ]
(  352 / 1155 )=>[ got further and further away ]
(  407 /  521 )=>[ paddled over to the illinois ]
(  829 / 1131 )=>[ it was a monstrous big river ]
( 1468 / 1623 )=>[ around each others necks and ]
( 2099 / 2523 )=>[ with a torchlight procession ]
(  104 /  408 )=>[ i went out in the woods and ]
(  347 /  935 )=>[ a quarter of a mile or more ]
(  393 / 1980 )=>[ wanted to know all about it ]
(  998 / 1237 )=>[ with the tears running down ]
(  998 / 1242 )=>[ with the tears running down ]
(  998 / 1470 )=>[ with the tears running down ]
( 1050 / 1134 )=>[ on tother side of the river ]
( 1237 / 1470 )=>[ with the tears running down ]
( 1242 / 1470 )=>[ with the tears running down ]
(   19 /  104 )=>[ i couldnt see no advantage ]
(   42 / 1613 )=>[ and stretched his neck out ]
(  104 / 2140 )=>[ i couldnt see no advantage ]
(  213 /  521 )=>[ over to the illinois shore ]
(  345 /  739 )=>[ up shore in the easy water ]
(  361 / 1311 )=>[ places on the ground where ]
(  812 / 1136 )=>[ couldnt tell nothing about ]
( 1013 / 1804 )=>[ there in the middle of the ]
( 1652 / 2119 )=>[ aint had no experience and ]
[download]

Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.

"Science is about questioning the status quo. Questioning authority".

In the absence of evidence, opinion is indistinguishable from prejudice.

"Too many [] have been sedated by an oppressive environment of political correctness and risk aversion."

In reply to Re: LCS efficiency problem by BrowserUk
in thread LCS efficiency problem by zhe

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.