in reply to Re: Does String::LCSS work?
in thread Does String::LCSS work?

Trust me, that's not fast. Using that script on 100 strings takes 76 seconds and 90MB:

c:\test>junk90 junk90.dat Loaded. Generating combos... 000001 and 000002: 127 chars ... starting at 37 and 872, respectively. 000008 and 000089: 10 chars ... starting at 550 and 355, respectively. 000040 and 000081: 11 chars ... starting at 219 and 623, respectively. 000046 and 000056: 12 chars ... starting at 808 and 845, respectively. 000058 and 000069: 11 chars ... starting at 837 and 276, respectively. Best overall match: 127 chars 000002:872 and 000001:37 Completed in 76.985

Using String::LCSS_XS it takes 15 seconds and 5MB:

c:\test>LCSS10 junk90.dat 000001(37) and 000002(872): 127 '5808821137152553645216516684787076304 +368738347768274782252043367265484547586755564151615422250715355234473 +558428710868782135070' 000008(550) and 000089(355): 10 '3252367176' 000040(219) and 000081(623): 11 '61341721171' 000046(808) and 000056(845): 12 '876526361506' 000058(837) and 000069(276): 11 '00666788082' Took: 14.594 seconds

Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.
"I'd rather go naked than blow up my ass"

Replies are listed 'Best First'.
Re^3: Does String::LCSS work?
by repellent (Priest) on Jan 27, 2010 at 20:00 UTC
    OK, I'll take your word that the performance isn't that great for practical use. It looks like the speed was as advertised though (100 strings of 1000 chars takes about 4 minutes -- in the order of minutes).

    Just glancing at the code, it seems that it doesn't suffer from the limitations of String::LCSS_XS pointed out by ikegami?

      I guess the hardware has moved on a bit in the last 5 years. I also used a file of 100 1000-char (byte) strings.

      For my purposes, unicode isn't a concern. Maybe 10 years from now, once we stop penny-pinching over memory with variable length character encodings, and start using straight 32-bit characters universally, it'll be possible to write efficient text-munging code again. Till then, I'll stick with ASCII/iso-whatever unless I'm forced to deal with it.

      IMO. The guys that came up with the variable-length encoding should be tried for treason to humanity :)


      Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
      "Science is about questioning the status quo. Questioning authority".
      In the absence of evidence, opinion is indistinguishable from prejudice.
        IMO. The guys that came up with the variable-length encoding should be tried for treason to humanity :)

        I agree with the proposed action, but I don't think the problem is the var length so much as the incredible amount of "appears to work" going on as a result of pretending to be backwards compatible. Things would work much better if they failed more noticeably when done wrong.

        For example, UTF-16 is a variable width encoding, yet when someone fails to handle it, they don't move on until it's fixed. ("How do I get rid of the space...")