Microcebus has asked for the wisdom of the Perl Monks concerning the following question:

Dear Monks,

I have 2 DNA sequences and want to perform a dotplot using a sliding window of a particular size with particular mismatch allowed. The following script produces a text matrix in which "1" stands for a hit.

The problem is, that this code works very slowly with long DNA sequences (>10000). I already use my four CPU cores for calculation but I've no idea how I could further speed it up. Any suggestions are welcome!

### CREATE TWO SAMPLE DNA SEQUENCES ### @nucleotides=('A','T','G','C'); foreach(1..1000) { $seq1.=$nucleotides[int(rand(3))]; } foreach(1..1000) { $seq2.=$nucleotides[int(rand(3))]; } ### SETTINGS FOR THE DOTPLOT ### $window_size=5; $max_mismatch=1; @seq1=split('',$seq1); foreach(1..$window_size) { shift@seq1; } @seq2=split('',$seq2); foreach(1..$window_size) { shift@seq2; } open(OUT,">ID_matrix.txt"); $number_of_windows_1=((length$seq1)-$window_size)+1; $number_of_windows_2=((length$seq2)-$window_size)+1; $time_start=time; foreach$window_no(0..$number_of_windows_1-1) { @seq2_temp=@seq2; if($window_no==0) { $current_window=substr($seq1,0,$window_size); @current_window=split('',$current_window); } else { shift@current_window; $next_character=shift@seq1; push(@current_window,$next_character); } foreach$query_no(0..$number_of_windows_2-1) { if($query_no==0) { $query_window=substr($seq2,0,$window_size); @query_window=split('',$query_window); } else { shift@query_window; $next_character=shift@seq2_temp; push(@query_window,$next_character); } $count_matches=0; foreach(0..$window_size-1) { if($current_window[$_]eq$query_window[$_]) { $count_matches++; last if($_-$count_matches>$max_mismatch); } } if($count_matches>=$window_size-$max_mismatch) { print OUT "1"; } else { print OUT "0"; } } print OUT"\n"; } $time_end=time; $time_used=$time_end-$time_start; close OUT; print"Time used: $time_used seconds.\n"; system("pause"); exit;

Replies are listed 'Best First'.
Re: Speed up DNA dotplot
by jwkrahn (Abbot) on Jul 14, 2011 at 06:35 UTC
    @nucleotides=('A','T','G','C'); foreach(1..1000) { $seq1.=$nucleotides[int(rand(3))]; } foreach(1..1000) { $seq2.=$nucleotides[int(rand(3))]; }

    You are only using the first three elements of the array @nucleotides.    The proper way to do that is:

    my @nucleotides = qw( A T G C ); foreach ( 1 .. 1_000 ) { $seq1 .= $nucleotides[ rand @nucleotides ]; $seq2 .= $nucleotides[ rand @nucleotides ]; }


    @seq1=split('',$seq1); foreach(1..$window_size) { shift@seq1; } @seq2=split('',$seq2); foreach(1..$window_size) { shift@seq2; }

    The usual way to do that is:

    my @seq1 = split //, $seq1; splice @seq1, 0, $window_size; my @seq2 = split //, $seq2; splice @seq2, 0, $window_size;


    open(OUT,">ID_matrix.txt");

    You should always verify that open worked correctly:

    open OUT, '>', 'ID_matrix.txt' or die "Cannot open 'ID_matrix.txt' bec +ause: $!";
Re: Speed up DNA dotplot
by happy.barney (Friar) on Jul 14, 2011 at 07:57 UTC
    I'm not sure if I really understood your requirements or code.
    my $MAX = 10_000; my $WINDOW_SIZE = 5; my $MAX_MISMATCH = 1; my $seq1 = join '', qw(A T G C)[ map int rand 4, 1 .. $MAX ]; my $seq2 = join '', qw(A T G C)[ map int rand 4, 1 .. $MAX ]; sub with_regexp { my ($seq1, $seq2, $window, $mismatch) = @_; my $retval = ''; for my $start (0 .. length ($seq1) - $window - 1) { my $regex = build_regexp (substr ($seq1, $start, $window), $mi +smatch); pos $seq2 = 0; do { $retval .= $seq2 =~ m/\G(?=$regex)/gc ? 1 : 0 } while $seq2 =~ m/\G(?=.{$window})./g; $retval .= "\n"; } $retval; } sub build_parts { my ($window, $mismatch) = @_; my $l = length $window; $mismatch = $l if $mismatch > $l; return $window unless $mismatch; return '.' x $l if $l == $mismatch; my ($first, $rest) = split //, $window, 2; return ( (map $first . $_, build_parts ($rest, $mismatch)), (map '.' . $_, build_parts ($rest, $mismatch -1)), ); } sub build_regexp { join '|', map '(?:' . $_ . ')', build_parts (@_); } print with_regexps ($seq1, $seq2, $WINDOW, $MAX_MISMATCH);
      small improvements:
      sub test_regexps2 { my ($seq1, $seq2, $window, $mismatch) = @_; my $retval = ''; my %cache; my @mask = (0) x (length ($seq2) - $window); for my $start (0 .. (length ($seq1) - $window)) { my $part = substr ($seq1, $start, $window); $retval .= $cache{$part} ||= do { my $regex = build_regexp ($part, $mismatch); my @res = @mask; while ($seq2 =~ m/(?=$regex)/g) { $res[ pos $seq2 ] = 1; } join '', @res, "\n"; }; } $retval; }
      benchmarks for length 200, 400 and 600
      200: Rate orig_poster test_regexps test_regexps2 orig_poster 9.70/s -- -51% -81% test_regexps 19.6/s 103% -- -61% test_regexps2 49.8/s 413% 153% -- 400: Rate orig_poster test_regexps test_regexps2 orig_poster 2.44/s -- -52% -84% test_regexps 5.08/s 109% -- -67% test_regexps2 15.5/s 535% 205% -- 600: Rate orig_poster test_regexps test_regexps2 orig_poster 1.06/s -- -54% -86% test_regexps 2.30/s 117% -- -70% test_regexps2 7.75/s 633% 237% --
Re: Speed up DNA dotplot (65% speedup)
by BrowserUk (Patriarch) on Jul 14, 2011 at 14:32 UTC

    Try this. It produces identical results to your posted code in less than half the time:

    #! perl -s use strict; use Math::Random::MT qw[ rand srand ]; use Time::HiRes qw[ time ]; srand 1; ### CREATE TWO SAMPLE DNA SEQUENCES ### my @nucleotides = ('A','T','G','C'); my $seq1 = join '', map $nucleotides[ rand 4 ], 1 .. 1000; my $seq2 = join '', map $nucleotides[ rand 4 ], 1 .. 1000; ### SETTINGS FOR THE DOTPLOT ### my $nWindow = 5; my $maxMisses = 1; open OUT, ">ID_matrix.txt" or die $!; my $nWindow1 = ( ( length $seq1 ) - $nWindow ); my $nWindow2 = ( ( length $seq2 ) - $nWindow ); my $time_start = time; for my $off1 ( 0 .. $nWindow1 ) { my $sub1 = substr $seq1, $off1, $nWindow; for my $off2 ( 0 .. $nWindow2 ) { my $sub2 = substr $seq2, $off2, $nWindow; my $misses = $nWindow - ( ( $sub1 ^ $sub2 ) =~ tr[\0][\0] ); print OUT $misses > $maxMisses ? 0 : 1; } print OUT "\n"; } my $time_end = time; my $time_used = $time_end - $time_start; close OUT; print"Time used: $time_used seconds.\n"; __END__ c:\test>junk9 ## original Time used: 4.12800002098084 seconds. Press any key to continue . . . c:\test>914283 ## This code Time used: 1.87000012397766 seconds. c:\test>dir ID* 14/07/2011 12:19 994,008 ID_matrix.ref 14/07/2011 15:24 994,008 ID_matrix.txt c:\test>diff ID_matrix.ref ID_matrix.txt ## sanity check

    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.