in reply to The sum of absolute differences in the counts of chars in two strings.

It just seems to me, and please don’t take this glibly or wrongly, that the first order of business would be to try to guess ... what would actually slow it down?   I can’t think of anything...

If the set of characters is fixed, then a fixed-size array will accumulate counts with zero difference of timing of one character versus another, and there is zero possibility of page-fault activity here.   Likewise, I can scarce imagine how the time required to simply loop through the entire arrays, each and every time, to calculate the differences could ever be objectionable ... or for that matter, varying.  

I mean, sure, you could think of “optimizations,” but in a fairly trivial case like this, they could wind up costing more time.   A completely dumb, completely brute-force solution is going to crunch through any set of strings presented to it with an unchanging (and extremely rapid) cycle-time per character no matter what the data turns out to be.

In short, my initial reaction is that the most-obvious way is the best way.   I genuinely suspect that any deviation from this straight-path just might wind up being significantly slower.

Replies are listed 'Best First'.
Re^2: The sum of absolute differences in the counts of chars in two strings.
by BrowserUk (Patriarch) on Nov 20, 2011 at 07:57 UTC
    I mean, sure, you could think of “optimizations,” but in a fairly trivial case like this, they could wind up costing more time.

    Time and again here, one or more of the monks have spotted a solution to a problem that at first sight appear intractable to optimisation.

    For example: in Comparing a large set of DNA sequences, the OP describes an apparently simple process of cross comparing 100,000 x 20-char strings, as "It takes on the orders of days to process".

    Various approaches to this problem have been suggested down the years including the use of any of several well-known fuzzy matching and edit distance algorithms, like String::Approx, Text::Levenshtien, Text::Soundex Text::WagnerFischer and others. But these brute-force, char-by-char O(M*N) comparisons are horrible expensive -- even when coded in XS -- and often not so useful for the desired results either, as edit-distances are a quite different metric to the required: 'are these two strings the same except for in at most n places'.

    Another frequently offered alternative is the use of regexes, perhaps programmically generated. Whilst these work fine and beat the edit-distance algorithms hands down, they generally require quite-to-very complex regexes that, by requirement, contain many backtrack points, with the result that their runtime performance is understandably very poor.

    The root of the problem is the O(N2/2) nature of the round-robin comparison process. The exponential growth in the numbers of comparisons required, means that even for relatively low numbers of strings -- and in genomic work, 100,000 is low -- every extra microsecond taken performing each comparison adds an hour and a half to the processing time.

    And by the time you get to a still relatively small 1 million strings, each extra microsecond will cost you an extra 138 hours. Almost an extra week of waiting, not to mention the ~15kWhs of electricity used in the process.

    It would likely not be at all obvious to you that using string-wise XOR would form the basis of the most efficient (pure Perl) method of doing this, but its use here reduces days of processing to a few hours.

    Even less obvious is the likelihood that anyone would offer a solution that reduces the number of comparisons required from 5 billion fuzzy matches to the order of 300,000 O(1) lookups, yet here it is.

    Whilst my question has yet to garner anything other than more efficiently implemented brute-force algorithms -- themselves very valuable in the absence of a different solution -- there are still at least a couple of avenues yet unexplored.

    So, whilst I couldn't think of any game changing algorithm or approach to the problem, I'm long enough in the tooth to know that it does no harm to ask and can often lead to surprises.

    So, you'll forgive me -- or not -- if I file your "I can’t think of anything.. non-response, under the heading of: "Saw the opportunity to garner a few XP, but couldn't think of anything useful to say, so I'll say just that, and wrap it in some other plausible but meaningless drivel."


    With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.

      Omitting ... please!! ... that final paragraph from your post, which was entirely unnecessary, I found your comments up to that point extremely interesting, and I immediately followed and read both the link that you gave in the second-from-last paragraph and the entire thread.

      Please enlighten me here ... I am quite, quite serious ... what is it about your original description of the problem that I have overlooked?   If the “alphabet” were a normal one, say no more than 256 elements, then we will have a 256-element array set to all-zero and we will tally each character with zero variance in time based on character value.   Then we will loop through exactly 256 accumulators, adding them up and setting them back to zero as we go.   And then we will have our answer.

      As do many of us, I have plenty of experience with (differing kinds of) “massive problems” in which very slight gains, repeated hundreds of millions of times if not many times more, make all the difference in the world.   Which, having said that, “is neither here nor there.”   So, I fully appreciate what you are saying right down to and including “15 kWh of electricity,” but I simply do not yet see what makes this particular problem “hard.”

      I know that it must be, or you would not have said it.   (I would be an idiot to question your knowledge or experience.)   But something in the description, I guess, eluded me.   What did I overlook?   I really do want to know.

        what is it ... that I have overlooked?

        Well, apart from the obvious: whatever any one of the other several hundred monks might think of that neither you and I did; you are asking the wrong question.

        Instead, try asking yourself: "What did I contribute to the thread?".

        And then you might ask yourself the same of Re: windows 7 HardwareID, amongst many others.


        With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        "Science is about questioning the status quo. Questioning authority".
        In the absence of evidence, opinion is indistinguishable from prejudice.