in reply to Re^3: Regexps for microsatellites
in thread Regexps for microsatellites

AAAGTC is Ax3, assuming a match of three or more counts. Shorter matches get preference.

Caution: Contents may have been coded under pressure.

Replies are listed 'Best First'.
Re^5: Regexps for microsatellites
by ikegami (Patriarch) on Nov 08, 2004 at 18:48 UTC

    So two questions for knirirr:

    1) Should CATCATCATCATCAT give (a) CATx5 or (b) CATCATx2?

    2) Should AAAGTCAAAGTCAAAGTC give (a) AAAx3 or (b) AAAGTCx3?

    This can be rephrased as:
    Should we favour longuest match (1a and 2b),
    should we favour longuest sequence (1b and 2b), or
    should we favour shortest sequence (1a and 2a)?

    Until then, we have:

    # Favour shortest sequence. my $thresh = 1; # Match *more than* $thresh times. while (<DATA>) { chomp; while (/((.{1,6}?)\2{$thresh,})/g) { printf( "Found %d %-6s (length=%d, total=%2d) at pos %2d in %s\n", length($1) / length($2), # Number of matches. $2, # Sequence. length($2), # Length of sequence. length($1), # Length of match. $-[0], # Start position. $_ # String we're searching. ); } } __DATA__ CATCATCATCATCAT AAAGTCAAAGTCAAAGTC gives: Found 5 CAT (length=3, total=15) at pos 0 in CATCATCATCATCAT Found 3 A (length=1, total= 3) at pos 0 in AAAGTCAAAGTCAAAGTC Found 2 GTCAAA (length=6, total=12) at pos 3 in AAAGTCAAAGTCAAAGTC
      The answers are 1a and 2b:
      1) (CAT)5, as (CATCAT) can be broken down into two smaller motifs.
      2) (AAAGTC)3, as the motifs must be continuous.
      Of course, if the thresholds were set to detect mono repeats of such a small size then we would indeed find (AAA)3 and (GTC)3, but I usually don't run thresholds that low as such things are not biologically significant. Generally, if it's over about 10 repeats of the pattern it might be interesting, or about 20 repeats if it's a single character. Sorry for the vagueness on this matter.