in reply to Re: finding tuples
in thread finding tuples

As I said above, it is true they are subsets, not substrings. For example, the single solution of AAAAEEEE is AAAA;EEEE, and EEEE;AAAA is identical to it, just written differently, so it should not count twice. Likewise the order within a subset: AAAA;GMST is identical to AAAA;TSGM and should count only once.

Duplicates occured when I used permutation. I give a minimal example. A permutation of AADE is AAED. But order does not matter. Therefore the permutation algorithm wasted time.

Typical set sizes range from 8 to 24 in steps of 4, but mostly longer than shorter.

The repertoire is limited to ADEFGMSTV.

I hope this helps your understanding.

Replies are listed 'Best First'.
Re^3: finding tuples
by moritz (Cardinal) on Jun 23, 2009 at 20:15 UTC
    So let me summarize:
    • You have a set of input characters and their repetition counts
    • You look for a maximal set of ordered four-tuples that each consist of either four identical or four completely different letters
    • Only if a character is repeated at least four times in the input it can be used in tuple with four identical characters

    That's the only set of rules that I could extract from what you wrote so far. However that doesn't explain this: Sometimes it is not advisable to form a sameness-tuple because then I destroy a solution involving multiple sequences

    What is the exact rule of how one tuple can "destroy" a solution, if overlap is allowed anyway?

      You look for a maximal set of ordered four-tuples that each consist of either four identical or four completely different letters

      The OP said "same or sequential alphabetic neighbours" and later said the alphabet is ADEFGMSTV. He also clarified that order isn't important.

      My best guess is that the following are allowed

      • AAAA
      • ADEF
      • AAAD

      and the following aren't

      • AEFG (A doesn't neighbour E, F or G)

      Please confirm.

      What is the exact rule of how one tuple can "destroy" a solution, if overlap is allowed anyway?

      A given instance of a letter can be reused. He's talking about overlaps in the input string.

      +-+-+-+----- tuple 1 matches character between indexes 0 and 6 | | | | AADDEEFF overlap over indexes 1 to 6 | | | | +-+-+-+---- tuple 2 matches character between indexes 1 and 7

        Almost right. AAAD is not allowed. It is neither 4 same nor 4 consecutive.

        You explained overlap correctly - it is an artifact from making a string out of a set.

      I must correct rule 2: I am looking for the complete set of unordered 4-tuples, either four identical or four consecutive. Four consecutive is any true 4-substring of ADEFGMSTV.

      Sometimes it is not advisable to form a sameness-tuple

      I will explain on a minimal example. Split up the set AADDDEEEEFFFFGGM. It is tempting to first take out the 4 same E and F, but then the remainder is AADDDGGMM from which a solution cannot be formed any more because they are not alphabetic neighbours. It is false to give up now and assume that aforementioned set has no solution as there *is* a (single) solution: ADEF;ADEF;DEFG;EFGM. So one has to be smart what to pick first to not end up in dead ends.

      Rarely there are multiple solutions, but they do occur. Trivial example is GGGGMMMMSSSSTTTT. Its two solutions are GGGG;MMMM;SSSS;TTTT and GMST;GMST;GMST;GMST.