Re: mismatching characters in dna sequence
by Eliya (Vicar) on Dec 30, 2011 at 03:26 UTC
|
Clever mapping of the characters in one of the strings could work around the problems with the XOR approach. For example, using tr/ATCG/HRDZ/:
my %change; # reverse lookup table
my @t = qw(A T C G);
for my $t1 (@t) {
for my $t2 (@t) {
my $t = $t2;
$t =~ tr/ATCG/HRDZ/;
$change{ $t1 ^ $t } = "$t1->$t2";
}
}
sub diff {
my ($target, $str) = @_;
$str =~ tr/ATCG/HRDZ/;
my $diff = $target ^ $str;
while ($diff =~ /([^\x09\x06\x07\x1d])/g) {
printf " %d: %s\n", pos($diff), $change{$1};
}
}
my $target = "ATTCCGGG";
for (qw(ATTGCGGG ATACCGGC)) {
print "\n$target\n$_\n";
diff($target, $_);
}
__END__
ATTCCGGG
ATTGCGGG
4: C->G
ATTCCGGG
ATACCGGC
3: T->A
8: G->C
The reverse lookup table just needs to be set up once, and the remaining operations (string bit operation, tr///, m//g) should all be pretty fast.
(As the keys in the lookup table are integers < 256, you could in theory also set up an array, and use the xor value as the index.)
| [reply] [d/l] [select] |
Re: mismatching characters in dna sequence
by choroba (Cardinal) on Dec 30, 2011 at 00:53 UTC
|
Have you tried binary xoring the strings? Zeros will occur in positions that are occupied by the same letter, the other results are unique for all combinations.
perl -e '$s1 = "AAAACCCCTTTTGGGG";
$s2 = "ACTGACTGACTGACTG";
print $s1 ^ $s2; ' | hexdump
00000000 00 02 15 06 02 00 17 04 15 17 00 13 06 04 13 00
| [reply] [d/l] |
|
|
i don't know what this means...
| [reply] |
|
|
If you apply binary xor (exclusive-or) on the strings, you get another string as a result. This string contains the null character in places where the two strings were identical, and various other characters in positions where the two strings were different. For each tuple of letters, the result is different.
| [reply] |
|
|
|
|
|
|
Re: mismatching characters in dna sequence
by davido (Cardinal) on Dec 30, 2011 at 01:52 UTC
|
This isn't built around PDL, but I'd be surprised if there's not a simple way to adapt it to work with your PDL objects.
use strict;
use warnings;
my $target = "ATTCCGGG";
my @tests = qw/ ATTGCGGG ATACCGGC /;
foreach my $test ( @tests ) {
print "Testing: Target: $target\tTest: $test\n";
my @results = map {
my ( $target_c, $test_c ) = (
substr( $target, $_, 1 ),
substr( $test, $_, 1 )
);
( $target_c ne $test_c ) ? [ $_, $target_c, $test_c ] : ();
} 0 .. length( $target ) - 1;
print "\t$_->[0]: $_->[1] -> $_->[2]\n" for @results;
}
Output:
Testing: Target: ATTCCGGG Test: ATTGCGGG
3: C -> G
Testing: Target: ATTCCGGG Test: ATACCGGC
2: T -> A
7: G -> C
| [reply] [d/l] [select] |
|
|
#!/opt/local/bin/perl
use Bio::Seq;
use Bio::DB::Sam;
use PDL;
use PDL::Char;
$PDL::SHARE = $PDL::SHARE;
# Load bam file
my $bamfile = Bio::DB::Sam->new( -bam => "eg1.bam" );
my @allReads = $bamfile->features( -type => 'match' );
for my $read (@allReads) {
# Get read and target sequences
( $ref, $matches, $query ) = $read->padded_alignment;
my $target = PDL::Char->new("$ref");
for my $str ("$query") {
my $match = PDL::Char->new($str);
my @diff = which( $match != $target )->list;
# Print read name and mismatching positions
print $read->qname,': ', "@diff\n";
}
}
| [reply] [d/l] |
|
|
| [reply] |
Re: mismatching characters in dna sequence
by BrowserUk (Patriarch) on Dec 30, 2011 at 06:39 UTC
|
You say lower down that "these are actually quite slow for even 10,000 comparisons".
- How long?
- How many comparisons?
Is the time you've (hopefully) supplied, for comparing each of the 10,000 sequences against one target string (ie. 10,000 compares)?
Or each of the 10,000 against each of the other 9,999 (ie. 50 million compares)?
- How long are the sequences you are dealing with?
I have a routine that will cross-compare 10,000 x 8-char sequences (50e6 comparisons) in 246 seconds. And 10,000 x 16-chars sequences in a little under 500 seconds. How does that compare to your current method?
To put that in perspective, if your "actually quite slow" is for 10,000 compares against a single target, then I can do that in 0.05 seconds.
With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
| [reply] |
|
|
i'm doing 10,000 compares of 40 character sequences against one target (10,000 compares) in about 5 seconds. your routine seems much faster!
| [reply] |
|
|
Which method are you talking about, with respect to those 5 seconds? The XOR method outlined above takes just ~0.05 secs on my 4-year-oldish machine, for 10,000 comparisons against a common 40-char target (with ~3-5 deviations per sequence):
my @set = qw(A T C G);
my $target = join "", map $set[rand @set], 1..40;
my @tests ;
for (1..10000) {
my $test = $target;
for (1..5) {
substr($test, rand(length($target)), 1) = $set[rand @set];
}
push @tests, $test;
}
use Time::HiRes qw(time);
my $start = time();
my %change; # reverse lookup table
for my $t1 (@set) {
my $t = $t1;
$t =~ tr/ATCG/HRDZ/;
for my $t2 (@set) {
$change{ $t ^ $t2 } = "$t1->$t2";
}
}
$target =~ tr/ATCG/HRDZ/;
for my $test (@tests) {
my $diff = $target ^ $test;
while ($diff =~ /([^\x09\x06\x07\x1d])/g) {
my $pos = pos($diff);
my $change = $change{$1};
# do something with them...
}
}
printf "%.3f secs\n", time() - $start;
__END__
0.052 secs
Storing away the results somewhere or doing something else with them will presumably take considerably longer than computing them...
| [reply] [d/l] |
|
|
|
|
Character by character processing of strings is the single biggest weakness in perl's arsenal. As you've already used PDl, you probably have a compiler, so dropping into Inline::C should not be a problem for you.
The two routines below differ in the way the return the results. dnacmp() returns a list of strings like this:"123:A:C", whereas dnacmp2() concatenates all those into a single string for return:
#! perl -slw
use strict;
use Inline C => Config => BUILD_NOISY => 1;
use Inline C => <<'END_C', NAME => 'dnacmp', CLEAN_AFTER_BUILD => 0;
void dnacmp( char *a, char *b ) {
int n = 0;
Inline_Stack_Vars;
Inline_Stack_Reset;
while( *a && *b ) {
if( *a != *b ) {
Inline_Stack_Push( sv_2mortal( newSVpvf( "%d:%c:%c", n, *a
+, *b ) ) );
}
a++; b++; n++;
}
Inline_Stack_Done;
return;
}
SV *dnacmp2( char *a, char *b ) {
SV *res = newSVpv( "", 0 );
int n = 0;
while( *a && *b ) {
if( *a != *b )
sv_catpvf( res, "%d:%c:%c ", n, *a, *b );
++a, ++b, ++n;
}
return res;
}
END_C
use Data::Dump qw[ pp ];
use Time::HiRes qw[ time ];
sub rndStr{ join'', @_[ map{ rand @_ } 1 .. shift ] }
print for dnacmp( 'ATTCCGGG', 'ATACCGC' );
print dnacmp2( 'ATTCGGG', 'ATACCGGC' );
our $N //= 10e3;
our $L //= 8;
my @seqs; $#seqs = $N-1;
$seqs[ $_ ] = rndStr( $L, qw[ A C G T ] ) for 0 .. $#seqs;
my $start = time;
for my $i ( 0 .. $#seqs ) {
for my $j ( $i+1 .. $#seqs ) {
print "@seqs[ $i, $j ] : ", dnacmp2( @seqs[ $i, $j ] );
}
}
printf STDERR "Took %f seconds to cross compare $N sequences (%d compa
+risons)\n",
time() - $start, ($N**2-$N)/2;
__END__
C:\test>dnacmp -N=142 -L=40
2:T:A
6:G:C
2:T:A 4:G:C
Took 0.233414 seconds to cross compare 142 sequences (10011 comparison
+s)
C:\test>head out.dat
2:T:A
6:G:C
2:T:A 4:G:C
CTTTACGGTCGCCAGTGCGCTAGCGGTGCTGGGTATGTCC TTTCGGTCAGGAATGGCGGCCAGCATATA
+ATCGGAGCCGT : 0:C:T 3:T:C 4:A:G 5:C:G 6:G:T 7:G:C 8:T:A 9:C:G 11:C:A
+12:C:A 13:A:T 15:T:G 16:G:C 17:C:G 20:T:C 24:G:A 25:G:T 26:T:A 27:G:T
+ 28:C:A 29:T:A 30:G:T 31:G:C 33:T:G 35:T:G 36:G:C 37:T:C 38:C:G 39:C:
+T
CTTTACGGTCGCCAGTGCGCTAGCGGTGCTGGGTATGTCC CCGGGAACTCATTGACTCATCTCTGCACG
+CGGAAAGAGAC : 1:T:C 2:T:G 3:T:G 4:A:G 5:C:A 6:G:A 7:G:C 10:G:A 11:C:T
+ 12:C:T 13:A:G 14:G:A 15:T:C 16:G:T 18:G:A 19:C:T 20:T:C 21:A:T 22:G:
+C 23:C:T 25:G:C 26:T:A 27:G:C 28:C:G 29:T:C 32:G:A 33:T:A 35:T:G 36:G
+:A 37:T:G 38:C:A
CTTTACGGTCGCCAGTGCGCTAGCGGTGCTGGGTATGTCC ACACCGTTACAGGAGCTTACGATAACGAC
+CTGACCAATAC : 0:C:A 1:T:C 2:T:A 3:T:C 4:A:C 5:C:G 6:G:T 7:G:T 8:T:A 1
+0:G:A 11:C:G 12:C:G 15:T:C 16:G:T 17:C:T 18:G:A 20:T:G 22:G:T 23:C:A
+24:G:A 25:G:C 26:T:G 27:G:A 29:T:C 30:G:T 32:G:A 33:T:C 34:A:C 35:T:A
+ 36:G:A 38:C:A
CTTTACGGTCGCCAGTGCGCTAGCGGTGCTGGGTATGTCC TTTATGCTGTAAACTAGACTAACTCCCTT
+AGCTACTGACG : 0:C:T 3:T:A 4:A:T 5:C:G 6:G:C 7:G:T 8:T:G 9:C:T 10:G:A
+11:C:A 12:C:A 13:A:C 14:G:T 15:T:A 17:C:A 18:G:C 19:C:T 20:T:A 22:G:C
+ 23:C:T 24:G:C 25:G:C 26:T:C 27:G:T 28:C:T 29:T:A 31:G:C 32:G:T 33:T:
+A 34:A:C 37:T:A 39:C:G
CTTTACGGTCGCCAGTGCGCTAGCGGTGCTGGGTATGTCC CTCGTTCGGTATATGGCTGACGGAATTCG
+CTACGGGCACG : 2:T:C 3:T:G 4:A:T 5:C:T 6:G:C 8:T:G 9:C:T 10:G:A 11:C:T
+ 12:C:A 13:A:T 15:T:G 16:G:C 17:C:T 19:C:A 20:T:C 21:A:G 23:C:A 24:G:
+A 25:G:T 27:G:C 28:C:G 29:T:C 30:G:T 31:G:A 32:G:C 33:T:G 34:A:G 35:T
+:G 36:G:C 37:T:A 39:C:G
CTTTACGGTCGCCAGTGCGCTAGCGGTGCTGGGTATGTCC AAAAGGTGCGGCACTAGCGTAAACTGCAC
+GAGCGTACACC : 0:C:A 1:T:A 2:T:A 3:T:A 4:A:G 5:C:G 6:G:T 8:T:C 9:C:G 1
+2:C:A 13:A:C 14:G:T 15:T:A 19:C:T 20:T:A 22:G:A 24:G:T 26:T:C 27:G:A
+29:T:G 30:G:A 32:G:C 33:T:G 34:A:T 35:T:A 36:G:C 37:T:A
CTTTACGGTCGCCAGTGCGCTAGCGGTGCTGGGTATGTCC TCAAGGGCGGCGGTAAAACAATGATGTAC
+ACCTCGTCGCG : 0:C:T 1:T:C 2:T:A 3:T:A 4:A:G 5:C:G 7:G:C 8:T:G 9:C:G 1
+0:G:C 11:C:G 12:C:G 13:A:T 14:G:A 15:T:A 16:G:A 17:C:A 18:G:C 19:C:A
+20:T:A 21:A:T 23:C:A 24:G:T 27:G:A 29:T:A 30:G:C 31:G:C 32:G:T 33:T:C
+ 34:A:G 36:G:C 37:T:G 39:C:G
C:\test>tail out.dat
TACATGCCCTATCCAGGCAACGCATTAGGCACTGCGTGTA ACAATGCGACGGCTCCAGACTTGATACCT
+GTATCAAACTA : 0:T:A 1:A:C 2:C:A 7:C:G 8:C:A 9:T:C 10:A:G 11:T:G 13:C:
+T 14:A:C 15:G:C 16:G:A 17:C:G 19:A:C 20:C:T 21:G:T 22:C:G 25:T:A 26:A
+:C 27:G:C 28:G:T 29:C:G 30:A:T 31:C:A 33:G:C 34:C:A 35:G:A 36:T:A 37:
+G:C
TACATGCCCTATCCAGGCAACGCATTAGGCACTGCGTGTA GGTGGAAACGGCGTGTGTAGAAATTTATT
+CACGGTACTGT : 0:T:G 1:A:G 2:C:T 3:A:G 4:T:G 5:G:A 6:C:A 7:C:A 9:T:G 1
+0:A:G 11:T:C 12:C:G 13:C:T 14:A:G 15:G:T 17:C:T 19:A:G 20:C:A 21:G:A
+22:C:A 23:A:T 27:G:T 28:G:T 32:T:G 34:C:T 35:G:A 36:T:C 37:G:T 38:T:G
+ 39:A:T
TACATGCCCTATCCAGGCAACGCATTAGGCACTGCGTGTA GATAGTTCCAAGCCACCTCCCACCCTGCT
+CAACATCGCTA : 0:T:G 2:C:T 4:T:G 5:G:T 6:C:T 9:T:A 11:T:G 15:G:C 16:G:
+C 17:C:T 18:A:C 19:A:C 21:G:A 23:A:C 24:T:C 26:A:G 27:G:C 28:G:T 31:C
+:A 32:T:C 33:G:A 34:C:T 35:G:C 36:T:G 37:G:C
TACATGCCCTATCCAGGCAACGCATTAGGCACTGCGTGTA AAGGTTGGTACTTACTAACCGGTTTGAGC
+CGTTCAACGAA : 0:T:A 2:C:G 3:A:G 5:G:T 6:C:G 7:C:G 8:C:T 9:T:A 10:A:C
+12:C:T 13:C:A 14:A:C 15:G:T 16:G:A 17:C:A 18:A:C 19:A:C 20:C:G 22:C:T
+ 23:A:T 25:T:G 28:G:C 30:A:G 31:C:T 33:G:C 34:C:A 35:G:A 36:T:C 38:T:
+A
ACAATGCGACGGCTCCAGACTTGATACCTGTATCAAACTA GGTGGAAACGGCGTGTGTAGAAATTTATT
+CACGGTACTGT : 0:A:G 1:C:G 2:A:T 3:A:G 4:T:G 5:G:A 6:C:A 7:G:A 8:A:C 9
+:C:G 11:G:C 12:C:G 14:C:G 15:C:T 16:A:G 17:G:T 19:C:G 20:T:A 21:T:A 2
+2:G:A 23:A:T 25:A:T 26:C:A 27:C:T 29:G:C 30:T:A 31:A:C 32:T:G 33:C:G
+34:A:T 36:A:C 37:C:T 38:T:G 39:A:T
ACAATGCGACGGCTCCAGACTTGATACCTGTATCAAACTA GATAGTTCCAAGCCACCTCCCACCCTGCT
+CAACATCGCTA : 0:A:G 1:C:A 2:A:T 4:T:G 5:G:T 6:C:T 7:G:C 8:A:C 9:C:A 1
+0:G:A 13:T:C 14:C:A 16:A:C 17:G:T 18:A:C 20:T:C 21:T:A 22:G:C 23:A:C
+24:T:C 25:A:T 26:C:G 29:G:C 30:T:A 32:T:C 33:C:A 34:A:T 35:A:C 36:A:G
ACAATGCGACGGCTCCAGACTTGATACCTGTATCAAACTA AAGGTTGGTACTTACTAACCGGTTTGAGC
+CGTTCAACGAA : 1:C:A 2:A:G 3:A:G 5:G:T 6:C:G 8:A:T 9:C:A 10:G:C 11:G:T
+ 12:C:T 13:T:A 15:C:T 17:G:A 18:A:C 20:T:G 21:T:G 22:G:T 23:A:T 25:A:
+G 26:C:A 27:C:G 28:T:C 29:G:C 30:T:G 31:A:T 36:A:C 37:C:G 38:T:A
GGTGGAAACGGCGTGTGTAGAAATTTATTCACGGTACTGT GATAGTTCCAAGCCACCTCCCACCCTGCT
+CAACATCGCTA : 1:G:A 3:G:A 5:A:T 6:A:T 7:A:C 9:G:A 10:G:A 11:C:G 12:G:
+C 13:T:C 14:G:A 15:T:C 16:G:C 18:A:C 19:G:C 20:A:C 22:A:C 23:T:C 24:T
+:C 26:A:G 27:T:C 31:C:A 32:G:C 33:G:A 35:A:C 36:C:G 37:T:C 38:G:T 39:
+T:A
GGTGGAAACGGCGTGTGTAGAAATTTATTCACGGTACTGT AAGGTTGGTACTTACTAACCGGTTTGAGC
+CGTTCAACGAA : 0:G:A 1:G:A 2:T:G 4:G:T 5:A:T 6:A:G 7:A:G 8:C:T 9:G:A 1
+0:G:C 11:C:T 12:G:T 13:T:A 14:G:C 16:G:A 17:T:A 18:A:C 19:G:C 20:A:G
+21:A:G 22:A:T 25:T:G 27:T:G 28:T:C 30:A:G 31:C:T 32:G:T 33:G:C 34:T:A
+ 37:T:G 38:G:A 39:T:A
GATAGTTCCAAGCCACCTCCCACCCTGCTCAACATCGCTA AAGGTTGGTACTTACTAACCGGTTTGAGC
+CGTTCAACGAA : 0:G:A 2:T:G 3:A:G 4:G:T 6:T:G 7:C:G 8:C:T 10:A:C 11:G:T
+ 12:C:T 13:C:A 14:A:C 15:C:T 16:C:A 17:T:A 20:C:G 21:A:G 22:C:T 23:C:
+T 24:C:T 25:T:G 26:G:A 27:C:G 28:T:C 30:A:G 31:A:T 32:C:T 33:A:C 34:T
+:A 35:C:A 36:G:C 37:C:G 38:T:A
With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
| [reply] [d/l] [select] |
|
|
|
|
|
Re: mismatching characters in dna sequence
by TJPride (Pilgrim) on Dec 30, 2011 at 04:11 UTC
|
Runs fine for me with 5 million characters on each string - only takes like 3 seconds and this is a really old comp.
use strict;
use warnings;
my ($s1, $s2, $l, $i, $c1, $c2, @m);
$s1 = 'ATACCGGC'; $s1 .= 'ATTTT'x1000000;
$s2 = 'ATTCCGGG'; $s2 .= 'ATTTT'x1000000;
for $i (0..(length($s1)-1)) {
$c1 = substr($s1, $i, 1);
$c2 = substr($s2, $i, 1);
push @m, [$i, $c1, $c2]
if $c1 ne $c2;
}
print (($#m+1) . ' mismatches with target at position(s) ' . join(', '
+, map { $_->[0] } @m) . ' (' . join(', ', map { $_->[2].'->'.$_->[1]
+} @m) . ')');
| [reply] [d/l] |
|
|
When comparing many strings against many strings (not just two), performance actually might matter.
On my machine, your substr method takes 2.65 secs, while the XOR method I suggested above takes only 0.12 secs (for the same data), i.e. it's roughly 20 times as fast.
| [reply] |
|
|
eliya -- is your transliteration correct? the results differ from the other code snippets already posted.
| [reply] |
|
|
|
|
|
Re: mismatching characters in dna sequence
by tobyink (Canon) on Dec 31, 2011 at 20:41 UTC
|
Personally I'd drop the A T C G approach altogether. As I understand it, with DNA a "word" is always three characters long. So "AACTCGATG" is really three words "AAC-TCG-ATG". There are 64 (i.e. 4^3) possible words, so each three letter word can be represented by a single byte in a binary string (with two bits left over).
Establish a mapping along the lines of AAA=chr(0), AAC=chr(1), AAG=chr(2), ..., GGG=chr(63). Then convert your "AACTCGATG" strings to binary strings when they are first input - keep them as binary strings everywhere internal within your program and just reformat them back to letters when you need to display them.
Your strings will be a third of the size of the original input strings, making string comparisons that much faster. (And yes, I'd use XOR to handle the comparisons - it's likely to be the fastest method.) Of course, if you've got millions of strings that need comparing, the initial import, converting from DNA letters into binary strings will add some overhead to your program, but because the number of possible comparisons scales exponentially with regard to the number of strings, the overhead is likely to be worth it.
I am assuming here that you're working on real DNA data, and not fake data made up for a programming assignment. If it's real data, then strings will always be a multiple of three characters, which is kinda important for the techniques I outline above to work. If it is fake data, then it's still doable - you just pad the end to a multiple of three characters, and then use those two spare bits on the last byte of the binary string to indicate how many padding letters there are. This does somewhat complicate implementation though.
| [reply] |
|
|
| [reply] |