So two questions for knirirr:
1) Should CATCATCATCATCAT give (a) CATx5 or (b) CATCATx2?
2) Should AAAGTCAAAGTCAAAGTC give (a) AAAx3 or (b) AAAGTCx3?
This can be rephrased as:
Should we favour longuest match (1a and 2b),
should we favour longuest sequence (1b and 2b), or
should we favour shortest sequence (1a and 2a)?
Until then, we have:
# Favour shortest sequence. my $thresh = 1; # Match *more than* $thresh times. while (<DATA>) { chomp; while (/((.{1,6}?)\2{$thresh,})/g) { printf( "Found %d %-6s (length=%d, total=%2d) at pos %2d in %s\n", length($1) / length($2), # Number of matches. $2, # Sequence. length($2), # Length of sequence. length($1), # Length of match. $-[0], # Start position. $_ # String we're searching. ); } } __DATA__ CATCATCATCATCAT AAAGTCAAAGTCAAAGTC gives: Found 5 CAT (length=3, total=15) at pos 0 in CATCATCATCATCAT Found 3 A (length=1, total= 3) at pos 0 in AAAGTCAAAGTCAAAGTC Found 2 GTCAAA (length=6, total=12) at pos 3 in AAAGTCAAAGTCAAAGTC
In reply to Re^5: Regexps for microsatellites
by ikegami
in thread Regexps for microsatellites
by knirirr
| For: | Use: | ||
| & | & | ||
| < | < | ||
| > | > | ||
| [ | [ | ||
| ] | ] |