in reply to Re^4: Regexps for microsatellites
in thread Regexps for microsatellites

So two questions for knirirr:

1) Should CATCATCATCATCAT give (a) CATx5 or (b) CATCATx2?

2) Should AAAGTCAAAGTCAAAGTC give (a) AAAx3 or (b) AAAGTCx3?

This can be rephrased as:
Should we favour longuest match (1a and 2b),
should we favour longuest sequence (1b and 2b), or
should we favour shortest sequence (1a and 2a)?

Until then, we have:

# Favour shortest sequence. my $thresh = 1; # Match *more than* $thresh times. while (<DATA>) { chomp; while (/((.{1,6}?)\2{$thresh,})/g) { printf( "Found %d %-6s (length=%d, total=%2d) at pos %2d in %s\n", length($1) / length($2), # Number of matches. $2, # Sequence. length($2), # Length of sequence. length($1), # Length of match. $-[0], # Start position. $_ # String we're searching. ); } } __DATA__ CATCATCATCATCAT AAAGTCAAAGTCAAAGTC gives: Found 5 CAT (length=3, total=15) at pos 0 in CATCATCATCATCAT Found 3 A (length=1, total= 3) at pos 0 in AAAGTCAAAGTCAAAGTC Found 2 GTCAAA (length=6, total=12) at pos 3 in AAAGTCAAAGTCAAAGTC

Replies are listed 'Best First'.
Re^6: Regexps for microsatellites
by knirirr (Scribe) on Nov 09, 2004 at 10:28 UTC
    The answers are 1a and 2b:
    1) (CAT)5, as (CATCAT) can be broken down into two smaller motifs.
    2) (AAAGTC)3, as the motifs must be continuous.
    Of course, if the thresholds were set to detect mono repeats of such a small size then we would indeed find (AAA)3 and (GTC)3, but I usually don't run thresholds that low as such things are not biologically significant. Generally, if it's over about 10 repeats of the pattern it might be interesting, or about 20 repeats if it's a single character. Sorry for the vagueness on this matter.