in reply to finding tuples

It's still entirely unclear to me how the tuples must be formed.

It seems that they can overlap, and don't even have to be substrings of the string you're searching on. What are the restrictions? when are two tuples "conceptual duplicates"? How long are your input strings, typically? Is your character repertoire always limited to A-Z?

Any advice how to go on?

First formulate your requirements in precise language, so that anybody how doesn't know about biology can help if you.

Replies are listed 'Best First'.
Re^2: finding tuples
by Anonymous Monk on Jun 23, 2009 at 19:52 UTC

    As I said above, it is true they are subsets, not substrings. For example, the single solution of AAAAEEEE is AAAA;EEEE, and EEEE;AAAA is identical to it, just written differently, so it should not count twice. Likewise the order within a subset: AAAA;GMST is identical to AAAA;TSGM and should count only once.

    Duplicates occured when I used permutation. I give a minimal example. A permutation of AADE is AAED. But order does not matter. Therefore the permutation algorithm wasted time.

    Typical set sizes range from 8 to 24 in steps of 4, but mostly longer than shorter.

    The repertoire is limited to ADEFGMSTV.

    I hope this helps your understanding.

      So let me summarize:
      • You have a set of input characters and their repetition counts
      • You look for a maximal set of ordered four-tuples that each consist of either four identical or four completely different letters
      • Only if a character is repeated at least four times in the input it can be used in tuple with four identical characters

      That's the only set of rules that I could extract from what you wrote so far. However that doesn't explain this: Sometimes it is not advisable to form a sameness-tuple because then I destroy a solution involving multiple sequences

      What is the exact rule of how one tuple can "destroy" a solution, if overlap is allowed anyway?

        You look for a maximal set of ordered four-tuples that each consist of either four identical or four completely different letters

        The OP said "same or sequential alphabetic neighbours" and later said the alphabet is ADEFGMSTV. He also clarified that order isn't important.

        My best guess is that the following are allowed

        • AAAA
        • ADEF
        • AAAD

        and the following aren't

        • AEFG (A doesn't neighbour E, F or G)

        Please confirm.

        What is the exact rule of how one tuple can "destroy" a solution, if overlap is allowed anyway?

        A given instance of a letter can be reused. He's talking about overlaps in the input string.

        +-+-+-+----- tuple 1 matches character between indexes 0 and 6 | | | | AADDEEFF overlap over indexes 1 to 6 | | | | +-+-+-+---- tuple 2 matches character between indexes 1 and 7

        I must correct rule 2: I am looking for the complete set of unordered 4-tuples, either four identical or four consecutive. Four consecutive is any true 4-substring of ADEFGMSTV.

        Sometimes it is not advisable to form a sameness-tuple

        I will explain on a minimal example. Split up the set AADDDEEEEFFFFGGM. It is tempting to first take out the 4 same E and F, but then the remainder is AADDDGGMM from which a solution cannot be formed any more because they are not alphabetic neighbours. It is false to give up now and assume that aforementioned set has no solution as there *is* a (single) solution: ADEF;ADEF;DEFG;EFGM. So one has to be smart what to pick first to not end up in dead ends.

        Rarely there are multiple solutions, but they do occur. Trivial example is GGGGMMMMSSSSTTTT. Its two solutions are GGGG;MMMM;SSSS;TTTT and GMST;GMST;GMST;GMST.