Let's explore the overall efficiency. There are nine positions and four different letters (A,C,G,T). This makes 4^9 = 262,144 unique permutations. Now examine the form of a pattern. Each position in the pattern can either be a match (which you've represented above as the actual matching letter) or a no match (which you've represented with a full stop. So every pattern can have two options (match or no match) for each of the nine positions, making only 2^9-1 = 511 unique permutations for patterns (subtract one for the case of no matches at all). Taking advantage of this fact may significantly reduce the number of comparisons necessary to find all possible matches.

To generalize this method, let's assume that there are n sequences of m positions. Checking each pattern against all of the sequences would require (2^m-1)*sum(1..n-1) comparisons. Below is my solution to the problem using this method.

use strict; use List::MoreUtils qw(uniq); # initialize sequence data and emtpy match set my $seq = [ [ qw(A C G C A T T C A) ], [ qw(A C T G G A T A C) ], [ qw(T C A G C C A T C) ] ]; my $n = scalar(@$seq); # num of sequences my $m = scalar(@{$seq->[0]}); # num of positions my $match = {}; # first iterate over all the possible patterns for the sequence data for my $pat (1 .. 2**$m-1) { # now create the pattern by converting the pattern number to # binary and then to a slice with which to compare sequences. # pretty slick, if you ask me. my @pattern = reverse split //, binary($pat); my @slice; for (0 .. $m-1) { push @slice, $_ if $pattern[$_] } # now iterate over every sequence. check to see if it matches any # of the others based on the given pattern. for my $i (0 .. $n-2) { for my $j ( $i+1 .. $n-1 ) { if ( join("", @{$seq->[$i]}[@slice]) eq join("", @{$seq->[$j]}[@slice]) ) { # construct a hash key and store the matches my $key = join "", map { $pattern[$_] ? $seq->[$i][$_] : "." } (0 .. $m-1); push @{$match->{$key}}, $i,$j ; } } } } # now strip out the duplicate sequence numbers for each match (an # unfortunate step I had trouble avoiding), then display the matches for my $key (keys %$match) { @{$match->{$key}} = uniq @{$match->{$key}}; print $key , " occurs in " , join(",", @{$match->{$key}}) , "\n"; } # a short decimal to binary converter I borrowed from Mark Dominus sub binary { my ($n) = @_; return $n if $n == 0 || $n == 1; my $k = int($n/2); my $b = $n % 2; my $E = binary($k); return $E . $b; }

Prints the following matches (note there are more than previously thought):
...G....C occurs in 1,2 .C......C occurs in 1,2 A.....T.. occurs in 0,1 .C.G....C occurs in 1,2 ........C occurs in 1,2 ...G..... occurs in 1,2 ......T.. occurs in 0,1 .C.G..... occurs in 1,2 A........ occurs in 0,1 AC....... occurs in 0,1 .C....T.. occurs in 0,1 AC....T.. occurs in 0,1 .C....... occurs in 0,1,2

In reply to Re: pattern finding algorithm by dogz007
in thread pattern finding algorithm by kdt2006

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.