Speed up DNA dotplot

Microcebus has asked for the wisdom of the Perl Monks concerning the following question:

Dear Monks,

I have 2 DNA sequences and want to perform a dotplot using a sliding window of a particular size with particular mismatch allowed. The following script produces a text matrix in which "1" stands for a hit.

The problem is, that this code works very slowly with long DNA sequences (>10000). I already use my four CPU cores for calculation but I've no idea how I could further speed it up. Any suggestions are welcome!

### CREATE TWO SAMPLE DNA SEQUENCES ###
@nucleotides=('A','T','G','C');
foreach(1..1000)
    {
    $seq1.=$nucleotides[int(rand(3))];
    }
foreach(1..1000)
    {
    $seq2.=$nucleotides[int(rand(3))];
    }


### SETTINGS FOR THE DOTPLOT ###
$window_size=5;
$max_mismatch=1;


@seq1=split('',$seq1);
foreach(1..$window_size)
    {
    shift@seq1;
    }
@seq2=split('',$seq2);
foreach(1..$window_size)
    {
    shift@seq2;
    }

open(OUT,">ID_matrix.txt");
$number_of_windows_1=((length$seq1)-$window_size)+1;
$number_of_windows_2=((length$seq2)-$window_size)+1;

$time_start=time;
foreach$window_no(0..$number_of_windows_1-1)
    {
    @seq2_temp=@seq2;
    if($window_no==0)
        {
        $current_window=substr($seq1,0,$window_size);
        @current_window=split('',$current_window);
        }
    else
        {
        shift@current_window;
        $next_character=shift@seq1;
        push(@current_window,$next_character);
        }
    foreach$query_no(0..$number_of_windows_2-1)
        {
        if($query_no==0)
            {
            $query_window=substr($seq2,0,$window_size);
            @query_window=split('',$query_window);
            }
        else
            {
            shift@query_window;
            $next_character=shift@seq2_temp;
            push(@query_window,$next_character);
            }
        $count_matches=0;
        foreach(0..$window_size-1)
            {
            if($current_window[$_]eq$query_window[$_])
                {
                $count_matches++;
                last if($_-$count_matches>$max_mismatch);
                }
            }
        if($count_matches>=$window_size-$max_mismatch)
            {
            print OUT "1";
            }
        else
            {
            print OUT "0";
            }
        }
    print OUT"\n";
    }
$time_end=time;
$time_used=$time_end-$time_start;
close OUT;
print"Time used: $time_used seconds.\n";
system("pause");
exit;
[download]

Comment on Speed up DNA dotplot Download Code

Replies are listed 'Best First'.
Re: Speed up DNA dotplot by jwkrahn (Abbot) on Jul 14, 2011 at 06:35 UTC
`@nucleotides=('A','T','G','C'); foreach(1..1000) { $seq1.=$nucleotides[int(rand(3))]; } foreach(1..1000) { $seq2.=$nucleotides[int(rand(3))]; }` [download] You are only using the first three elements of the array `@nucleotides`. The proper way to do that is: `my @nucleotides = qw( A T G C ); foreach ( 1 .. 1_000 ) { $seq1 .= $nucleotides[ rand @nucleotides ]; $seq2 .= $nucleotides[ rand @nucleotides ]; }` [download] `@seq1=split('',$seq1); foreach(1..$window_size) { shift@seq1; } @seq2=split('',$seq2); foreach(1..$window_size) { shift@seq2; }` [download] The usual way to do that is: `my @seq1 = split //, $seq1; splice @seq1, 0, $window_size; my @seq2 = split //, $seq2; splice @seq2, 0, $window_size;` [download] `open(OUT,">ID_matrix.txt");` [download] You should always verify that open worked correctly: `open OUT, '>', 'ID_matrix.txt' or die "Cannot open 'ID_matrix.txt' bec +ause: $!";` [download]	[reply] [d/l] [select]
Re: Speed up DNA dotplot by happy.barney (Friar) on Jul 14, 2011 at 07:57 UTC
I'm not sure if I really understood your requirements or code. my $MAX = 10_000; my $WINDOW_SIZE = 5; my $MAX_MISMATCH = 1; my $seq1 = join '', qw(A T G C)[ map int rand 4, 1 .. $MAX ]; my $seq2 = join '', qw(A T G C)[ map int rand 4, 1 .. $MAX ]; sub with_regexp { my ($seq1, $seq2, $window, $mismatch) = @_; my $retval = ''; for my $start (0 .. length ($seq1) - $window - 1) { my $regex = build_regexp (substr ($seq1, $start, $window), $mi +smatch); pos $seq2 = 0; do { $retval .= $seq2 =~ m/\G(?=$regex)/gc ? 1 : 0 } while $seq2 =~ m/\G(?=.{$window})./g; $retval .= "\n"; } $retval; } sub build_parts { my ($window, $mismatch) = @_; my $l = length $window; $mismatch = $l if $mismatch > $l; return $window unless $mismatch; return '.' x $l if $l == $mismatch; my ($first, $rest) = split //, $window, 2; return ( (map $first . $_, build_parts ($rest, $mismatch)), (map '.' . $_, build_parts ($rest, $mismatch -1)), ); } sub build_regexp { join '\|', map '(?:' . $_ . ')', build_parts (@_); } print with_regexps ($seq1, $seq2, $WINDOW, $MAX_MISMATCH); [download]	[reply] [d/l]
Re^2: Speed up DNA dotplot by happy.barney (Friar) on Jul 14, 2011 at 09:35 UTC
small improvements: `sub test_regexps2 { my ($seq1, $seq2, $window, $mismatch) = @_; my $retval = ''; my %cache; my @mask = (0) x (length ($seq2) - $window); for my $start (0 .. (length ($seq1) - $window)) { my $part = substr ($seq1, $start, $window); $retval .= $cache{$part} \|\|= do { my $regex = build_regexp ($part, $mismatch); my @res = @mask; while ($seq2 =~ m/(?=$regex)/g) { $res[ pos $seq2 ] = 1; } join '', @res, "\n"; }; } $retval; }` [download] benchmarks for length 200, 400 and 600 `200: Rate orig_poster test_regexps test_regexps2 orig_poster 9.70/s -- -51% -81% test_regexps 19.6/s 103% -- -61% test_regexps2 49.8/s 413% 153% -- 400: Rate orig_poster test_regexps test_regexps2 orig_poster 2.44/s -- -52% -84% test_regexps 5.08/s 109% -- -67% test_regexps2 15.5/s 535% 205% -- 600: Rate orig_poster test_regexps test_regexps2 orig_poster 1.06/s -- -54% -86% test_regexps 2.30/s 117% -- -70% test_regexps2 7.75/s 633% 237% --` [download]	[reply] [d/l] [select]
Re: Speed up DNA dotplot (65% speedup) by BrowserUk (Patriarch) on Jul 14, 2011 at 14:32 UTC
Try this. It produces identical results to your posted code in less than half the time: #! perl -s use strict; use Math::Random::MT qw[ rand srand ]; use Time::HiRes qw[ time ]; srand 1; ### CREATE TWO SAMPLE DNA SEQUENCES ### my @nucleotides = ('A','T','G','C'); my $seq1 = join '', map $nucleotides[ rand 4 ], 1 .. 1000; my $seq2 = join '', map $nucleotides[ rand 4 ], 1 .. 1000; ### SETTINGS FOR THE DOTPLOT ### my $nWindow = 5; my $maxMisses = 1; open OUT, ">ID_matrix.txt" or die $!; my $nWindow1 = ( ( length $seq1 ) - $nWindow ); my $nWindow2 = ( ( length $seq2 ) - $nWindow ); my $time_start = time; for my $off1 ( 0 .. $nWindow1 ) { my $sub1 = substr $seq1, $off1, $nWindow; for my $off2 ( 0 .. $nWindow2 ) { my $sub2 = substr $seq2, $off2, $nWindow; my $misses = $nWindow - ( ( $sub1 ^ $sub2 ) =~ tr[\0][\0] ); print OUT $misses > $maxMisses ? 0 : 1; } print OUT "\n"; } my $time_end = time; my $time_used = $time_end - $time_start; close OUT; print"Time used: $time_used seconds.\n"; __END__ c:\test>junk9 ## original Time used: 4.12800002098084 seconds. Press any key to continue . . . c:\test>914283 ## This code Time used: 1.87000012397766 seconds. c:\test>dir ID* 14/07/2011 12:19 994,008 ID_matrix.ref 14/07/2011 15:24 994,008 ID_matrix.txt c:\test>diff ID_matrix.ref ID_matrix.txt ## sanity check [download] Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. "Science is about questioning the status quo. Questioning authority". In the absence of evidence, opinion is indistinguishable from prejudice.	[reply] [d/l]