comment on

I'm re-writing some code that searches for microsatellites in a genome. Essentially, this code is a lot of stuff surrounding a regular expression that searches for particular repetitive patterns of G,A,T or C in a long (1000s of characters) string containing those letters only. It is supposed to find patterns of 1-6 letters repeating N or more times, e.g. (A)N, (AT)N, (ATT)N, (ATTC)N, (ATTCC)N or (ATTCCG)N. One way I've done this is to search through the string 6 times with:

$threshold = "N" # see description above
for($i=1;$i<=6;$i++) {
my $pattern = "." x $i;
while ($sequence_string =~ /($pattern)\1{$threshold,}/g) { .... }
}
[download]

When code is added to make sure that (ATT)12 is not also detected as (ATT)6 &c. this works quite well, and also provides me with length($1), which is important. But, as it reads through the sequence 6 times it is quite slow. I'd like to replace it with a regexp that reads through the same string only once and finds the same things. I have something like this: while ($sequence_string =~ /(.)\1{$threshold,}|(.{2,3}?)\2{$threshold,}|(.{4,6}?)\3{$threshold,}/g) I'm not really satisfied with this, though. Can anyone suggest a cleaner way to do it? It would be handy if I could iterate from 1 to total genome length and attempt to find all possible combinations above the threshold length at that position. If none are found, move forward to the next charater, but if one is, move forward to the character after the end of the pattern found. Perhaps something like:

$i = 1;
while ($i < $genome_length)
{
   for ($j=1;$j<=6;$j++)
   {
      # run regexp looking for motifs of length $j
      # starting at position $i
   }
   $i = (end of previous match + 1)
}
[download]

Any suggestions would be welcome. Thanks!

Janitored by Arunbear - replaced pre tags with code tags, as per Monastery guidelines

In reply to Regexps for microsatellites by knirirr

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.