Re: Regexps for microsatellites

I would suggest something like this as it only requires a single pass, uses index which is faster than an RE and will accomodate non adjacent matching.

my $dna = 'CATCATCAT_____CATCATCATCAT____CAT_CAT___CAT__CAT___';

my $find  = 'CAT';
my $fudge = 1;

# the fudge factor allow detection of sequences that are close 
# but not immediately adjacent to one another. 
# set to 0 no separation is allowed, 
# set to 1 there can be 0-1 bases between the find pattern etc.

my $index       = 0;
my $last_index  = -999999;
my $start_index = 0;
my $num         = 0;
my $cur_offset  = 0;
my $len         = length $find;
my $hash;

while ( ($index = index($dna, $find,$cur_offset)) != -1 ) {
    if ( $index <= ($last_index+$len+$fudge) ) {
        $num++;
    }
    else {
        if ( $num ) {
            print "$num\n";
            push @{$hash->{$num}}, [$start_index, $last_index+$len-1];
        }
        print "Found at $index, repeats "; 
        $start_index = $index;
        $num = 1;      
    }    
    $cur_offset = $index+$len;
    $last_index = $index;
}

# get the last match, if it exists
if ( $num ) {
    print "$num\n";
    push @{$hash->{$num}}, [$start_index, $last_index+$len-1];
}

for $num( sort { $a<=>$b } keys %$hash ) {
    printf "%d repeat\n\t%d found\n", 
        $num, scalar(@{$hash->{$num}});
    printf "\t\tOffset %d - %d (%d)\n", @$_, ($_->[1]-$_->[0]+1) for @
+{$hash->{$num}};
}

__DATA__
Found at 0, repeats 3
Found at 14, repeats 4
Found at 30, repeats 2
Found at 40, repeats 1
Found at 45, repeats 1

1 repeat
    2 found
        Offset 40 - 42 (3)
        Offset 45 - 47 (3)
2 repeat
    1 found
        Offset 30 - 36 (7)
3 repeat
    1 found
        Offset 0 - 8 (9)
4 repeat
    1 found
        Offset 14 - 25 (12)
[download]

cheers

tachyon

Comment on Re: Regexps for microsatellites Download Code