in reply to Re: Regexps for microsatellites
in thread Regexps for microsatellites

The fact that the OP looks for patterns in order of increasing length suggests that the pattern should not include a repeat.

Caution: Contents may have been coded under pressure.

Replies are listed 'Best First'.
Re^3: Regexps for microsatellites
by ikegami (Patriarch) on Nov 08, 2004 at 16:23 UTC

    If that's so, then your solution doesn't work. It can easily be fixed by substituting (.{1,6}?) for the existing (.{1,6}).

    Update: Nope, adding '?' is no good, cause it'll think AAAGTCAAAGTC is Ax3 instead of AAAGTCx2.

      AAAGTC is Ax3, assuming a match of three or more counts. Shorter matches get preference.

      Caution: Contents may have been coded under pressure.

        So two questions for knirirr:

        1) Should CATCATCATCATCAT give (a) CATx5 or (b) CATCATx2?

        2) Should AAAGTCAAAGTCAAAGTC give (a) AAAx3 or (b) AAAGTCx3?

        This can be rephrased as:
        Should we favour longuest match (1a and 2b),
        should we favour longuest sequence (1b and 2b), or
        should we favour shortest sequence (1a and 2a)?

        Until then, we have:

        # Favour shortest sequence. my $thresh = 1; # Match *more than* $thresh times. while (<DATA>) { chomp; while (/((.{1,6}?)\2{$thresh,})/g) { printf( "Found %d %-6s (length=%d, total=%2d) at pos %2d in %s\n", length($1) / length($2), # Number of matches. $2, # Sequence. length($2), # Length of sequence. length($1), # Length of match. $-[0], # Start position. $_ # String we're searching. ); } } __DATA__ CATCATCATCATCAT AAAGTCAAAGTCAAAGTC gives: Found 5 CAT (length=3, total=15) at pos 0 in CATCATCATCATCAT Found 3 A (length=1, total= 3) at pos 0 in AAAGTCAAAGTCAAAGTC Found 2 GTCAAA (length=6, total=12) at pos 3 in AAAGTCAAAGTCAAAGTC