prbndr has asked for the wisdom of the Perl Monks concerning the following question:

hi monks,

i'm still relatively new at perl but getting better and better with practice. my question for now is how to compare two strings (of the same length of let's say about 40 characters) and obtain the counts of potentially mismatching characters. take, for example, the following two strings:

target = ATTCCGGG str1 = ATTGCGGG str2 = ATACCGGC

i would like to compare str1 to "target" and str2 to "target" and count the mismatch type at the mismatching position. comparing str1 to target gives one mismatch at position 3 which is a C->G. comparing str2 to target gives two mismatches at positions 2 and 7 which are T->A and G->C respectively. is there an efficient way to do this for millions of different targets and strings?

i have the following code using PDL:
use PDL; use PDL::Char; + $PDL::SHARE=$PDL::SHARE; # keep stray warning quiet my $source=PDL::Char->new("ATTCCGGG"); + for my $str ( "ATTGCGGG") { my $match =PDL::Char->new($str); + my @diff=which($match!=$source)->list; + print "@diff\n"; + }
this code doesn't give me the specific types of mismatches though. an A,T,G, or C in the target can transform into an A,T,G, or C in the strings, so i would like to keep track of these conversions. any advice?

Original content restored above by GrandFather

oops...i deleted my post...

i'm trying to find the positions and types of differences between two strings. take the "target" and the strings:

$target = "ATTCCGGG"; $str1 = "ATTGCGGG"; # 1 mismatch with target at position 3 (C->G) $str2 = "ATACCGGC"; # 2 mismatches with target at position 2 and 7 (T->A and G->C)

how do i go about obtaining the differences between millions of targets and strings in an efficient manner? i have the following code using PDL:
use PDL; use PDL::Char; + $PDL::SHARE=$PDL::SHARE; # keep stray warning quiet my $source=PDL::Char->new("ATTCCGGG"); + for my $str ( "ATTGCGGG", "ATACCGGC") { + my $match =PDL::Char->new($str); + my @diff=which($match!=$source)->list; + print "@diff\n"; + }
but this only gives me positions. how do i look for the actual conversions that occur too? an A,T,C,G can convert to an A,T,C, or G respectively. any advice?

Replies are listed 'Best First'.
Re: mismatching characters in dna sequence
by Eliya (Vicar) on Dec 30, 2011 at 03:26 UTC

    Clever mapping of the characters in one of the strings could work around the problems with the XOR approach.  For example, using tr/ATCG/HRDZ/:

    my %change; # reverse lookup table my @t = qw(A T C G); for my $t1 (@t) { for my $t2 (@t) { my $t = $t2; $t =~ tr/ATCG/HRDZ/; $change{ $t1 ^ $t } = "$t1->$t2"; } } sub diff { my ($target, $str) = @_; $str =~ tr/ATCG/HRDZ/; my $diff = $target ^ $str; while ($diff =~ /([^\x09\x06\x07\x1d])/g) { printf " %d: %s\n", pos($diff), $change{$1}; } } my $target = "ATTCCGGG"; for (qw(ATTGCGGG ATACCGGC)) { print "\n$target\n$_\n"; diff($target, $_); } __END__ ATTCCGGG ATTGCGGG 4: C->G ATTCCGGG ATACCGGC 3: T->A 8: G->C

    The reverse lookup table just needs to be set up once, and the remaining operations (string bit operation, tr///, m//g) should all be pretty fast.

    (As the keys in the lookup table are integers < 256, you could in theory also set up an array, and use the xor value as the index.)

Re: mismatching characters in dna sequence
by choroba (Cardinal) on Dec 30, 2011 at 00:53 UTC
    Have you tried binary xoring the strings? Zeros will occur in positions that are occupied by the same letter, the other results are unique for all combinations.
    perl -e '$s1 = "AAAACCCCTTTTGGGG"; $s2 = "ACTGACTGACTGACTG"; print $s1 ^ $s2; ' | hexdump 00000000 00 02 15 06 02 00 17 04 15 17 00 13 06 04 13 00
      i don't know what this means...
        If you apply binary xor (exclusive-or) on the strings, you get another string as a result. This string contains the null character in places where the two strings were identical, and various other characters in positions where the two strings were different. For each tuple of letters, the result is different.
Re: mismatching characters in dna sequence
by davido (Cardinal) on Dec 30, 2011 at 01:52 UTC

    This isn't built around PDL, but I'd be surprised if there's not a simple way to adapt it to work with your PDL objects.

    use strict; use warnings; my $target = "ATTCCGGG"; my @tests = qw/ ATTGCGGG ATACCGGC /; foreach my $test ( @tests ) { print "Testing: Target: $target\tTest: $test\n"; my @results = map { my ( $target_c, $test_c ) = ( substr( $target, $_, 1 ), substr( $test, $_, 1 ) ); ( $target_c ne $test_c ) ? [ $_, $target_c, $test_c ] : (); } 0 .. length( $target ) - 1; print "\t$_->[0]: $_->[1] -> $_->[2]\n" for @results; }

    Output:

    Testing: Target: ATTCCGGG Test: ATTGCGGG 3: C -> G Testing: Target: ATTCCGGG Test: ATACCGGC 2: T -> A 7: G -> C

    Dave

      ^^how fast would this be for millions of sequences? also, here's my PDL code:

      #!/opt/local/bin/perl use Bio::Seq; use Bio::DB::Sam; use PDL; use PDL::Char; $PDL::SHARE = $PDL::SHARE; # Load bam file my $bamfile = Bio::DB::Sam->new( -bam => "eg1.bam" ); my @allReads = $bamfile->features( -type => 'match' ); for my $read (@allReads) { # Get read and target sequences ( $ref, $matches, $query ) = $read->padded_alignment; my $target = PDL::Char->new("$ref"); for my $str ("$query") { my $match = PDL::Char->new($str); my @diff = which( $match != $target )->list; # Print read name and mismatching positions print $read->qname,': ', "@diff\n"; } }

      these are actually quite slow for even 10,000 comparisons. i think it has to do with the use of substr. any suggestions on how to scale up the performance?

Re: mismatching characters in dna sequence
by BrowserUk (Patriarch) on Dec 30, 2011 at 06:39 UTC

    You say lower down that "these are actually quite slow for even 10,000 comparisons".

    1. How long?
    2. How many comparisons?

      Is the time you've (hopefully) supplied, for comparing each of the 10,000 sequences against one target string (ie. 10,000 compares)?

      Or each of the 10,000 against each of the other 9,999 (ie. 50 million compares)?

    3. How long are the sequences you are dealing with?

    I have a routine that will cross-compare 10,000 x 8-char sequences (50e6 comparisons) in 246 seconds. And 10,000 x 16-chars sequences in a little under 500 seconds. How does that compare to your current method?

    To put that in perspective, if your "actually quite slow" is for 10,000 compares against a single target, then I can do that in 0.05 seconds.


    With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.

    The start of some sanity?

      i'm doing 10,000 compares of 40 character sequences against one target (10,000 compares) in about 5 seconds. your routine seems much faster!

        Which method are you talking about, with respect to those 5 seconds?  The XOR method outlined above takes just ~0.05 secs on my 4-year-oldish machine, for 10,000 comparisons against a common 40-char target (with ~3-5 deviations per sequence):

        my @set = qw(A T C G); my $target = join "", map $set[rand @set], 1..40; my @tests ; for (1..10000) { my $test = $target; for (1..5) { substr($test, rand(length($target)), 1) = $set[rand @set]; } push @tests, $test; } use Time::HiRes qw(time); my $start = time(); my %change; # reverse lookup table for my $t1 (@set) { my $t = $t1; $t =~ tr/ATCG/HRDZ/; for my $t2 (@set) { $change{ $t ^ $t2 } = "$t1->$t2"; } } $target =~ tr/ATCG/HRDZ/; for my $test (@tests) { my $diff = $target ^ $test; while ($diff =~ /([^\x09\x06\x07\x1d])/g) { my $pos = pos($diff); my $change = $change{$1}; # do something with them... } } printf "%.3f secs\n", time() - $start; __END__ 0.052 secs

        Storing away the results somewhere or doing something else with them will presumably take considerably longer than computing them...

        Character by character processing of strings is the single biggest weakness in perl's arsenal. As you've already used PDl, you probably have a compiler, so dropping into Inline::C should not be a problem for you.

        The two routines below differ in the way the return the results. dnacmp() returns a list of strings like this:"123:A:C", whereas dnacmp2() concatenates all those into a single string for return:

        #! perl -slw use strict; use Inline C => Config => BUILD_NOISY => 1; use Inline C => <<'END_C', NAME => 'dnacmp', CLEAN_AFTER_BUILD => 0; void dnacmp( char *a, char *b ) { int n = 0; Inline_Stack_Vars; Inline_Stack_Reset; while( *a && *b ) { if( *a != *b ) { Inline_Stack_Push( sv_2mortal( newSVpvf( "%d:%c:%c", n, *a +, *b ) ) ); } a++; b++; n++; } Inline_Stack_Done; return; } SV *dnacmp2( char *a, char *b ) { SV *res = newSVpv( "", 0 ); int n = 0; while( *a && *b ) { if( *a != *b ) sv_catpvf( res, "%d:%c:%c ", n, *a, *b ); ++a, ++b, ++n; } return res; } END_C use Data::Dump qw[ pp ]; use Time::HiRes qw[ time ]; sub rndStr{ join'', @_[ map{ rand @_ } 1 .. shift ] } print for dnacmp( 'ATTCCGGG', 'ATACCGC' ); print dnacmp2( 'ATTCGGG', 'ATACCGGC' ); our $N //= 10e3; our $L //= 8; my @seqs; $#seqs = $N-1; $seqs[ $_ ] = rndStr( $L, qw[ A C G T ] ) for 0 .. $#seqs; my $start = time; for my $i ( 0 .. $#seqs ) { for my $j ( $i+1 .. $#seqs ) { print "@seqs[ $i, $j ] : ", dnacmp2( @seqs[ $i, $j ] ); } } printf STDERR "Took %f seconds to cross compare $N sequences (%d compa +risons)\n", time() - $start, ($N**2-$N)/2; __END__ C:\test>dnacmp -N=142 -L=40 2:T:A 6:G:C 2:T:A 4:G:C Took 0.233414 seconds to cross compare 142 sequences (10011 comparison +s) C:\test>head out.dat 2:T:A 6:G:C 2:T:A 4:G:C CTTTACGGTCGCCAGTGCGCTAGCGGTGCTGGGTATGTCC TTTCGGTCAGGAATGGCGGCCAGCATATA +ATCGGAGCCGT : 0:C:T 3:T:C 4:A:G 5:C:G 6:G:T 7:G:C 8:T:A 9:C:G 11:C:A +12:C:A 13:A:T 15:T:G 16:G:C 17:C:G 20:T:C 24:G:A 25:G:T 26:T:A 27:G:T + 28:C:A 29:T:A 30:G:T 31:G:C 33:T:G 35:T:G 36:G:C 37:T:C 38:C:G 39:C: +T CTTTACGGTCGCCAGTGCGCTAGCGGTGCTGGGTATGTCC CCGGGAACTCATTGACTCATCTCTGCACG +CGGAAAGAGAC : 1:T:C 2:T:G 3:T:G 4:A:G 5:C:A 6:G:A 7:G:C 10:G:A 11:C:T + 12:C:T 13:A:G 14:G:A 15:T:C 16:G:T 18:G:A 19:C:T 20:T:C 21:A:T 22:G: +C 23:C:T 25:G:C 26:T:A 27:G:C 28:C:G 29:T:C 32:G:A 33:T:A 35:T:G 36:G +:A 37:T:G 38:C:A CTTTACGGTCGCCAGTGCGCTAGCGGTGCTGGGTATGTCC ACACCGTTACAGGAGCTTACGATAACGAC +CTGACCAATAC : 0:C:A 1:T:C 2:T:A 3:T:C 4:A:C 5:C:G 6:G:T 7:G:T 8:T:A 1 +0:G:A 11:C:G 12:C:G 15:T:C 16:G:T 17:C:T 18:G:A 20:T:G 22:G:T 23:C:A +24:G:A 25:G:C 26:T:G 27:G:A 29:T:C 30:G:T 32:G:A 33:T:C 34:A:C 35:T:A + 36:G:A 38:C:A CTTTACGGTCGCCAGTGCGCTAGCGGTGCTGGGTATGTCC TTTATGCTGTAAACTAGACTAACTCCCTT +AGCTACTGACG : 0:C:T 3:T:A 4:A:T 5:C:G 6:G:C 7:G:T 8:T:G 9:C:T 10:G:A +11:C:A 12:C:A 13:A:C 14:G:T 15:T:A 17:C:A 18:G:C 19:C:T 20:T:A 22:G:C + 23:C:T 24:G:C 25:G:C 26:T:C 27:G:T 28:C:T 29:T:A 31:G:C 32:G:T 33:T: +A 34:A:C 37:T:A 39:C:G CTTTACGGTCGCCAGTGCGCTAGCGGTGCTGGGTATGTCC CTCGTTCGGTATATGGCTGACGGAATTCG +CTACGGGCACG : 2:T:C 3:T:G 4:A:T 5:C:T 6:G:C 8:T:G 9:C:T 10:G:A 11:C:T + 12:C:A 13:A:T 15:T:G 16:G:C 17:C:T 19:C:A 20:T:C 21:A:G 23:C:A 24:G: +A 25:G:T 27:G:C 28:C:G 29:T:C 30:G:T 31:G:A 32:G:C 33:T:G 34:A:G 35:T +:G 36:G:C 37:T:A 39:C:G CTTTACGGTCGCCAGTGCGCTAGCGGTGCTGGGTATGTCC AAAAGGTGCGGCACTAGCGTAAACTGCAC +GAGCGTACACC : 0:C:A 1:T:A 2:T:A 3:T:A 4:A:G 5:C:G 6:G:T 8:T:C 9:C:G 1 +2:C:A 13:A:C 14:G:T 15:T:A 19:C:T 20:T:A 22:G:A 24:G:T 26:T:C 27:G:A +29:T:G 30:G:A 32:G:C 33:T:G 34:A:T 35:T:A 36:G:C 37:T:A CTTTACGGTCGCCAGTGCGCTAGCGGTGCTGGGTATGTCC TCAAGGGCGGCGGTAAAACAATGATGTAC +ACCTCGTCGCG : 0:C:T 1:T:C 2:T:A 3:T:A 4:A:G 5:C:G 7:G:C 8:T:G 9:C:G 1 +0:G:C 11:C:G 12:C:G 13:A:T 14:G:A 15:T:A 16:G:A 17:C:A 18:G:C 19:C:A +20:T:A 21:A:T 23:C:A 24:G:T 27:G:A 29:T:A 30:G:C 31:G:C 32:G:T 33:T:C + 34:A:G 36:G:C 37:T:G 39:C:G C:\test>tail out.dat TACATGCCCTATCCAGGCAACGCATTAGGCACTGCGTGTA ACAATGCGACGGCTCCAGACTTGATACCT +GTATCAAACTA : 0:T:A 1:A:C 2:C:A 7:C:G 8:C:A 9:T:C 10:A:G 11:T:G 13:C: +T 14:A:C 15:G:C 16:G:A 17:C:G 19:A:C 20:C:T 21:G:T 22:C:G 25:T:A 26:A +:C 27:G:C 28:G:T 29:C:G 30:A:T 31:C:A 33:G:C 34:C:A 35:G:A 36:T:A 37: +G:C TACATGCCCTATCCAGGCAACGCATTAGGCACTGCGTGTA GGTGGAAACGGCGTGTGTAGAAATTTATT +CACGGTACTGT : 0:T:G 1:A:G 2:C:T 3:A:G 4:T:G 5:G:A 6:C:A 7:C:A 9:T:G 1 +0:A:G 11:T:C 12:C:G 13:C:T 14:A:G 15:G:T 17:C:T 19:A:G 20:C:A 21:G:A +22:C:A 23:A:T 27:G:T 28:G:T 32:T:G 34:C:T 35:G:A 36:T:C 37:G:T 38:T:G + 39:A:T TACATGCCCTATCCAGGCAACGCATTAGGCACTGCGTGTA GATAGTTCCAAGCCACCTCCCACCCTGCT +CAACATCGCTA : 0:T:G 2:C:T 4:T:G 5:G:T 6:C:T 9:T:A 11:T:G 15:G:C 16:G: +C 17:C:T 18:A:C 19:A:C 21:G:A 23:A:C 24:T:C 26:A:G 27:G:C 28:G:T 31:C +:A 32:T:C 33:G:A 34:C:T 35:G:C 36:T:G 37:G:C TACATGCCCTATCCAGGCAACGCATTAGGCACTGCGTGTA AAGGTTGGTACTTACTAACCGGTTTGAGC +CGTTCAACGAA : 0:T:A 2:C:G 3:A:G 5:G:T 6:C:G 7:C:G 8:C:T 9:T:A 10:A:C +12:C:T 13:C:A 14:A:C 15:G:T 16:G:A 17:C:A 18:A:C 19:A:C 20:C:G 22:C:T + 23:A:T 25:T:G 28:G:C 30:A:G 31:C:T 33:G:C 34:C:A 35:G:A 36:T:C 38:T: +A ACAATGCGACGGCTCCAGACTTGATACCTGTATCAAACTA GGTGGAAACGGCGTGTGTAGAAATTTATT +CACGGTACTGT : 0:A:G 1:C:G 2:A:T 3:A:G 4:T:G 5:G:A 6:C:A 7:G:A 8:A:C 9 +:C:G 11:G:C 12:C:G 14:C:G 15:C:T 16:A:G 17:G:T 19:C:G 20:T:A 21:T:A 2 +2:G:A 23:A:T 25:A:T 26:C:A 27:C:T 29:G:C 30:T:A 31:A:C 32:T:G 33:C:G +34:A:T 36:A:C 37:C:T 38:T:G 39:A:T ACAATGCGACGGCTCCAGACTTGATACCTGTATCAAACTA GATAGTTCCAAGCCACCTCCCACCCTGCT +CAACATCGCTA : 0:A:G 1:C:A 2:A:T 4:T:G 5:G:T 6:C:T 7:G:C 8:A:C 9:C:A 1 +0:G:A 13:T:C 14:C:A 16:A:C 17:G:T 18:A:C 20:T:C 21:T:A 22:G:C 23:A:C +24:T:C 25:A:T 26:C:G 29:G:C 30:T:A 32:T:C 33:C:A 34:A:T 35:A:C 36:A:G ACAATGCGACGGCTCCAGACTTGATACCTGTATCAAACTA AAGGTTGGTACTTACTAACCGGTTTGAGC +CGTTCAACGAA : 1:C:A 2:A:G 3:A:G 5:G:T 6:C:G 8:A:T 9:C:A 10:G:C 11:G:T + 12:C:T 13:T:A 15:C:T 17:G:A 18:A:C 20:T:G 21:T:G 22:G:T 23:A:T 25:A: +G 26:C:A 27:C:G 28:T:C 29:G:C 30:T:G 31:A:T 36:A:C 37:C:G 38:T:A GGTGGAAACGGCGTGTGTAGAAATTTATTCACGGTACTGT GATAGTTCCAAGCCACCTCCCACCCTGCT +CAACATCGCTA : 1:G:A 3:G:A 5:A:T 6:A:T 7:A:C 9:G:A 10:G:A 11:C:G 12:G: +C 13:T:C 14:G:A 15:T:C 16:G:C 18:A:C 19:G:C 20:A:C 22:A:C 23:T:C 24:T +:C 26:A:G 27:T:C 31:C:A 32:G:C 33:G:A 35:A:C 36:C:G 37:T:C 38:G:T 39: +T:A GGTGGAAACGGCGTGTGTAGAAATTTATTCACGGTACTGT AAGGTTGGTACTTACTAACCGGTTTGAGC +CGTTCAACGAA : 0:G:A 1:G:A 2:T:G 4:G:T 5:A:T 6:A:G 7:A:G 8:C:T 9:G:A 1 +0:G:C 11:C:T 12:G:T 13:T:A 14:G:C 16:G:A 17:T:A 18:A:C 19:G:C 20:A:G +21:A:G 22:A:T 25:T:G 27:T:G 28:T:C 30:A:G 31:C:T 32:G:T 33:G:C 34:T:A + 37:T:G 38:G:A 39:T:A GATAGTTCCAAGCCACCTCCCACCCTGCTCAACATCGCTA AAGGTTGGTACTTACTAACCGGTTTGAGC +CGTTCAACGAA : 0:G:A 2:T:G 3:A:G 4:G:T 6:T:G 7:C:G 8:C:T 10:A:C 11:G:T + 12:C:T 13:C:A 14:A:C 15:C:T 16:C:A 17:T:A 20:C:G 21:A:G 22:C:T 23:C: +T 24:C:T 25:T:G 26:G:A 27:C:G 28:T:C 30:A:G 31:A:T 32:C:T 33:A:C 34:T +:A 35:C:A 36:G:C 37:C:G 38:T:A

        With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        "Science is about questioning the status quo. Questioning authority".
        In the absence of evidence, opinion is indistinguishable from prejudice.

        The start of some sanity?

Re: mismatching characters in dna sequence
by TJPride (Pilgrim) on Dec 30, 2011 at 04:11 UTC
    Runs fine for me with 5 million characters on each string - only takes like 3 seconds and this is a really old comp.

    use strict; use warnings; my ($s1, $s2, $l, $i, $c1, $c2, @m); $s1 = 'ATACCGGC'; $s1 .= 'ATTTT'x1000000; $s2 = 'ATTCCGGG'; $s2 .= 'ATTTT'x1000000; for $i (0..(length($s1)-1)) { $c1 = substr($s1, $i, 1); $c2 = substr($s2, $i, 1); push @m, [$i, $c1, $c2] if $c1 ne $c2; } print (($#m+1) . ' mismatches with target at position(s) ' . join(', ' +, map { $_->[0] } @m) . ' (' . join(', ', map { $_->[2].'->'.$_->[1] +} @m) . ')');

      When comparing many strings against many strings (not just two), performance actually might matter.

      On my machine, your substr method takes 2.65 secs, while the XOR method I suggested above takes only 0.12 secs (for the same data), i.e. it's roughly 20 times as fast.

        eliya -- is your transliteration correct? the results differ from the other code snippets already posted.
Re: mismatching characters in dna sequence
by tobyink (Canon) on Dec 31, 2011 at 20:41 UTC

    Personally I'd drop the A T C G approach altogether. As I understand it, with DNA a "word" is always three characters long. So "AACTCGATG" is really three words "AAC-TCG-ATG". There are 64 (i.e. 4^3) possible words, so each three letter word can be represented by a single byte in a binary string (with two bits left over).

    Establish a mapping along the lines of AAA=chr(0), AAC=chr(1), AAG=chr(2), ..., GGG=chr(63). Then convert your "AACTCGATG" strings to binary strings when they are first input - keep them as binary strings everywhere internal within your program and just reformat them back to letters when you need to display them.

    Your strings will be a third of the size of the original input strings, making string comparisons that much faster. (And yes, I'd use XOR to handle the comparisons - it's likely to be the fastest method.) Of course, if you've got millions of strings that need comparing, the initial import, converting from DNA letters into binary strings will add some overhead to your program, but because the number of possible comparisons scales exponentially with regard to the number of strings, the overhead is likely to be worth it.

    I am assuming here that you're working on real DNA data, and not fake data made up for a programming assignment. If it's real data, then strings will always be a multiple of three characters, which is kinda important for the techniques I outline above to work. If it is fake data, then it's still doable - you just pad the end to a multiple of three characters, and then use those two spare bits on the last byte of the binary string to indicate how many padding letters there are. This does somewhat complicate implementation though.

      As I understand it, with DNA a "word" is always three characters long.... I am assuming here that you're working on real DNA data, and not fake data made up for a programming assignment.

      The trouble with that idea is you are assuming that all DNA data is correctly transcribed. But often the very purpose of these programs is to detect errors in the transcriptions.


      With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
      Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
      "Science is about questioning the status quo. Questioning authority".
      In the absence of evidence, opinion is indistinguishable from prejudice.

      The start of some sanity?