knirirr has asked for the wisdom of the Perl Monks concerning the following question:

I'm re-writing some code that searches for microsatellites in a genome. Essentially, this code is a lot of stuff surrounding a regular expression that searches for particular repetitive patterns of G,A,T or C in a long (1000s of characters) string containing those letters only. It is supposed to find patterns of 1-6 letters repeating N or more times, e.g. (A)N, (AT)N, (ATT)N, (ATTC)N, (ATTCC)N or (ATTCCG)N. One way I've done this is to search through the string 6 times with:
$threshold = "N" # see description above for($i=1;$i<=6;$i++) { my $pattern = "." x $i; while ($sequence_string =~ /($pattern)\1{$threshold,}/g) { .... } }
When code is added to make sure that (ATT)12 is not also detected as (ATT)6 &c. this works quite well, and also provides me with length($1), which is important. But, as it reads through the sequence 6 times it is quite slow. I'd like to replace it with a regexp that reads through the same string only once and finds the same things. I have something like this: while ($sequence_string =~ /(.)\1{$threshold,}|(.{2,3}?)\2{$threshold,}|(.{4,6}?)\3{$threshold,}/g) I'm not really satisfied with this, though. Can anyone suggest a cleaner way to do it? It would be handy if I could iterate from 1 to total genome length and attempt to find all possible combinations above the threshold length at that position. If none are found, move forward to the next charater, but if one is, move forward to the character after the end of the pattern found. Perhaps something like:
$i = 1; while ($i < $genome_length) { for ($j=1;$j<=6;$j++) { # run regexp looking for motifs of length $j # starting at position $i } $i = (end of previous match + 1) }
Any suggestions would be welcome. Thanks!

Janitored by Arunbear - replaced pre tags with code tags, as per Monastery guidelines

Replies are listed 'Best First'.
Re: Regexps for microsatellites
by Roy Johnson (Monsignor) on Nov 08, 2004 at 15:14 UTC
    Does this do what you want?
    my $thresh = 3; # as per Ikegami's note, pattern must repeat MORE THAN + this many times while (<DATA>) { while (/((.{1,6})\2{$thresh,})/g) { printf "$2 (length=%d) repeats in $1\n", length $2; printf "Found at $-[0] to %d in $_", pos()-1; } } __DATA__ GATTATTATTATTATTATTGCATATATATAGCAAAAAATTTTTTGC ATATTATATTATATTATATTGC ATAGACATAGACATAGAC

    Caution: Contents may have been coded under pressure.

      The threshold is off by one. Your regexp matches "more than N" rather than "N or more". Subtract one from $threshold before building your regexp.

Re: Regexps for microsatellites
by erix (Prior) on Nov 08, 2004 at 14:40 UTC

    Does not bioperl have this?

    You can search the bioperl site (and other bioinformatics sites) with this page: open-bio.org.

Re: Regexps for microsatellites
by bobf (Monsignor) on Nov 08, 2004 at 21:54 UTC

    Other monks helped with the regex approach (per your request). I've used a similar approach for sequence searches, but I thought you might be interested in a couple of other ideas as well.

    • A highly optimized program was written to identify perfect simple tandem repeats in the human genome (Abstract on PubMed). The executables are available here.
    • The regex will find only exact repeats. That means something like GATGATaGATGAT will not be found (with a window size of 3 and repeat threshold of 3). If you wanted to also identify imperfect repeats, you could use something like REPuter (standalone versions are available).
    • Depending on the length of the sequence and the number of searches you will be performing, you may want to consider creating an index of the sequence (a la FASTA or BLAST) or use a suffix tree approach. A dynamic programming approach would likely be far too slow.

    That said, the regex method will certainly work, and it may be just as fast and easier to implement and maintain. I just wanted to provide a couple of other options.

    HTH

Re: Regexps for microsatellites
by tachyon (Chancellor) on Nov 09, 2004 at 01:40 UTC

    I would suggest something like this as it only requires a single pass, uses index which is faster than an RE and will accomodate non adjacent matching.

    my $dna = 'CATCATCAT_____CATCATCATCAT____CAT_CAT___CAT__CAT___'; my $find = 'CAT'; my $fudge = 1; # the fudge factor allow detection of sequences that are close # but not immediately adjacent to one another. # set to 0 no separation is allowed, # set to 1 there can be 0-1 bases between the find pattern etc. my $index = 0; my $last_index = -999999; my $start_index = 0; my $num = 0; my $cur_offset = 0; my $len = length $find; my $hash; while ( ($index = index($dna, $find,$cur_offset)) != -1 ) { if ( $index <= ($last_index+$len+$fudge) ) { $num++; } else { if ( $num ) { print "$num\n"; push @{$hash->{$num}}, [$start_index, $last_index+$len-1]; } print "Found at $index, repeats "; $start_index = $index; $num = 1; } $cur_offset = $index+$len; $last_index = $index; } # get the last match, if it exists if ( $num ) { print "$num\n"; push @{$hash->{$num}}, [$start_index, $last_index+$len-1]; } for $num( sort { $a<=>$b } keys %$hash ) { printf "%d repeat\n\t%d found\n", $num, scalar(@{$hash->{$num}}); printf "\t\tOffset %d - %d (%d)\n", @$_, ($_->[1]-$_->[0]+1) for @ +{$hash->{$num}}; } __DATA__ Found at 0, repeats 3 Found at 14, repeats 4 Found at 30, repeats 2 Found at 40, repeats 1 Found at 45, repeats 1 1 repeat 2 found Offset 40 - 42 (3) Offset 45 - 47 (3) 2 repeat 1 found Offset 30 - 36 (7) 3 repeat 1 found Offset 0 - 8 (9) 4 repeat 1 found Offset 14 - 25 (12)

    cheers

    tachyon

Re: Regexps for microsatellites
by ikegami (Patriarch) on Nov 08, 2004 at 16:10 UTC

    The problem is incompletely defined. What do you want if you encounter "CATCATCATCATCATCATCAT"? Do you want the longest match (7 "CAT"s for a total length 21) or do you want to match using the longuest repeating sequence (3 "CATCAT"s for a total length 18)? You only mentioned the case where the matches had the same length.

      The fact that the OP looks for patterns in order of increasing length suggests that the pattern should not include a repeat.

      Caution: Contents may have been coded under pressure.

        If that's so, then your solution doesn't work. It can easily be fixed by substituting (.{1,6}?) for the existing (.{1,6}).

        Update: Nope, adding '?' is no good, cause it'll think AAAGTCAAAGTC is Ax3 instead of AAAGTCx2.

      In this case it's definitely the (CAT)7, as if the pattern can be broken down into identical parts then it is treated as a smaller microsatellite. I have some code to do that, e.g.
      my $v1 = substr($match, 0, 1); my $v2 = substr($match, 1, 1); next if ($match =~ /^($v1$v2){2,}$/);