in reply to Re^2: Regexps for microsatellites
in thread Regexps for microsatellites

If that's so, then your solution doesn't work. It can easily be fixed by substituting (.{1,6}?) for the existing (.{1,6}).

Update: Nope, adding '?' is no good, cause it'll think AAAGTCAAAGTC is Ax3 instead of AAAGTCx2.

Replies are listed 'Best First'.
Re^4: Regexps for microsatellites
by Roy Johnson (Monsignor) on Nov 08, 2004 at 18:01 UTC
    AAAGTC is Ax3, assuming a match of three or more counts. Shorter matches get preference.

    Caution: Contents may have been coded under pressure.

      So two questions for knirirr:

      1) Should CATCATCATCATCAT give (a) CATx5 or (b) CATCATx2?

      2) Should AAAGTCAAAGTCAAAGTC give (a) AAAx3 or (b) AAAGTCx3?

      This can be rephrased as:
      Should we favour longuest match (1a and 2b),
      should we favour longuest sequence (1b and 2b), or
      should we favour shortest sequence (1a and 2a)?

      Until then, we have:

      # Favour shortest sequence. my $thresh = 1; # Match *more than* $thresh times. while (<DATA>) { chomp; while (/((.{1,6}?)\2{$thresh,})/g) { printf( "Found %d %-6s (length=%d, total=%2d) at pos %2d in %s\n", length($1) / length($2), # Number of matches. $2, # Sequence. length($2), # Length of sequence. length($1), # Length of match. $-[0], # Start position. $_ # String we're searching. ); } } __DATA__ CATCATCATCATCAT AAAGTCAAAGTCAAAGTC gives: Found 5 CAT (length=3, total=15) at pos 0 in CATCATCATCATCAT Found 3 A (length=1, total= 3) at pos 0 in AAAGTCAAAGTCAAAGTC Found 2 GTCAAA (length=6, total=12) at pos 3 in AAAGTCAAAGTCAAAGTC
        The answers are 1a and 2b:
        1) (CAT)5, as (CATCAT) can be broken down into two smaller motifs.
        2) (AAAGTC)3, as the motifs must be continuous.
        Of course, if the thresholds were set to detect mono repeats of such a small size then we would indeed find (AAA)3 and (GTC)3, but I usually don't run thresholds that low as such things are not biologically significant. Generally, if it's over about 10 repeats of the pattern it might be interesting, or about 20 repeats if it's a single character. Sorry for the vagueness on this matter.