Re: Regexps for microsatellites

Other monks helped with the regex approach (per your request). I've used a similar approach for sequence searches, but I thought you might be interested in a couple of other ideas as well.

A highly optimized program was written to identify perfect simple tandem repeats in the human genome (Abstract on PubMed). The executables are available here.
The regex will find only exact repeats. That means something like GATGATaGATGAT will not be found (with a window size of 3 and repeat threshold of 3). If you wanted to also identify imperfect repeats, you could use something like REPuter (standalone versions are available).
Depending on the length of the sequence and the number of searches you will be performing, you may want to consider creating an index of the sequence (a la FASTA or BLAST) or use a suffix tree approach. A dynamic programming approach would likely be far too slow.

That said, the regex method will certainly work, and it may be just as fast and easier to implement and maintain. I just wanted to provide a couple of other options.

HTH

Comment on Re: Regexps for microsatellites Download Code