in reply to pattern finding algorithm
If you really want to look for any individual common characters between all possible pairs of 2000 sequences then you just gotta do the work, and there is a lot of work to do!
It may help to tell us why you want to do that. There may be a better solution to your problem than the brute force search implied so far.
You may find this code interesting however:
use strict; use warnings; my @sequences = qw(ACGCATTCA ACTGGATAC TCAGCCATC); my %matches; for my $outer (0 .. $#sequences - 1) { for my $inner ($outer + 1 .. $#sequences) { my $mask = $sequences[$outer] ^ $sequences[$inner]; next if index ($mask, "\0") == -1; # No matching characters $mask =~ tr/\0/\xff/c; $mask |= $sequences[$outer]; $mask =~ tr/\xff/./; push @{$matches{$mask}}, [$outer + 1, $inner + 1]; } } for my $match (sort keys %matches) { print "$match pattern between ", join (', ', map {"$_->[0] and $_->[1]"} @{$matches{$match}}), "\n"; }
Prints:
.C....... pattern between 1 and 3 .C.G....C pattern between 2 and 3 AC....T.. pattern between 1 and 2
|
---|