comment on

Good question! In the benchmark cited, I used 100 1000-char string (just because that was what the benchmark of another algorithm posted here used).

In the following examples, LCSS10.pl is the script that uses String::LCSS_XS. The -MIN= parameter to the script simply stops it printing out any results less than MIN length. It doesn't stop it looking for and finding them.

LCSSN.pl uses my pp implementation. Here, the -MIN= parameter doesn't stop it locating smaller matches. But it does stop it from considering them when looking for the best match, which helps performance a lot.

Increasing the length doesn't affect the performance differential much. If anything it seems to favour my implementation. The following is run on 4, 100,000 char strings:

C:\test>perl -s LCSS10.pl -MIN=10 -- < 4x1e5.dat
000001(3463) and 000002(91858): 10 '8890235173'
000001(18712) and 000004(79151): 12 '260703543044'
000002(39758) and 000003(4141): 10 '1595533057'
000002(61266) and 000004(29466): 10 '6247963240'
000003(45661) and 000004(32254): 12 '381074855852'
Took: 170.777 seconds

C:\test>perl -s LCSSN.pl -MIN=10 -- < 4x1e5.dat
000001(3463) and 000002(91858): 10 '8890235173'
000001(18712) and 000004(79151): 12 '260703543044'
000002(39758) and 000003(4141): 10 '1595533057'
000002(61266) and 000004(29466): 10 '6247963240'
000003(45661) and 000004(32254): 12 '381074855852'
Took: 110.078 seconds
[download]

However, reducing the -MIN=2 does affect it, and shows String::LCSS_XS in a good light. The following are run on the same file:

C:\test>perl -s LCSS10.pl -MIN=2 -- < 4x1e5.dat
000001(3463) and 000002(91858): 10 '8890235173'
000001(34923) and 000003(7672): 9 '826854356'
000001(18712) and 000004(79151): 12 '260703543044'
000002(39758) and 000003(4141): 10 '1595533057'
000002(61266) and 000004(29466): 10 '6247963240'
000003(45661) and 000004(32254): 12 '381074855852'
Took: 171.108 seconds

C:\test>perl -s LCSSN.pl -MIN=2 -- < 4x1e5.dat
000001(3463) and 000002(91858): 10 '8890235173'
000001(91904) and 000003(82429): 9 '784839043'
000001(18712) and 000004(79151): 12 '260703543044'
000002(39758) and 000003(4141): 10 '1595533057'
000002(61266) and 000004(29466): 10 '6247963240'
000003(45661) and 000004(32254): 12 '381074855852'
Took: 1240.604 seconds
[download]

All the extra locating and overwriting the best yet does affect the algorithm badly by comparison. Presumably, if you added a MIN parameter to your code, it would benefit from it when used and would then beat mine blow for blow.

That said. Mine works best when there is large length differential between the two strings being compared. It currently makes 2 passes of the longer string for each character in the shorter string. As both are the same length in these benchmarks these tests are worst case.

I have a notion that, if coded in C, I can reduce that to 1 pass which would redress the balance a little, but probably not enough to make it worth while for the cases where both strings are approximately the same length.

Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.

"Science is about questioning the status quo. Questioning authority".

In the absence of evidence, opinion is indistinguishable from prejudice.

"I'd rather go naked than blow up my ass"

In reply to Re^4: A better implementation of LCSS? by BrowserUk
in thread A better implementation of LCSS? by BrowserUk

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.