in reply to pattern finding algorithm

kdt2006, please correct if I'm wrong, but I read your last reply to mean that you want the maximal match in each subset of sequences. For example, I added a new sequence to my code and got the following results:

my $seq = [ [ qw(A C G C A T T C A) ], [ qw(A C T G G A T A C) ], [ qw(T C A G C C A T C) ], [ qw(C T G G A T C G C) ] ];

Outputs:
..G.AT... occurs in 0,3 ...G....C occurs in 1,2,3 .C......C occurs in 1,2 A.....T.. occurs in 0,1 ........C occurs in 1,2,3 ...G..... occurs in 1,2,3 A........ occurs in 0,1 ....AT... occurs in 0,3 AC....... occurs in 0,1 ..G.A.... occurs in 0,3 ..G..T... occurs in 0,3 .C.G....C occurs in 1,2 ..G...... occurs in 0,3 ......T.. occurs in 0,1 .C.G..... occurs in 1,2 .C....T.. occurs in 0,1 AC....T.. occurs in 0,1 .....T... occurs in 0,3 ....A.... occurs in 0,3 .C....... occurs in 0,1,2

Notice above how the maximal match between 1 and 2 is .C.G....C, while a submatch of this, ...G....C, is the maximal match between 1, 2, and 3. From what I understand you would want both of these, even though one is a submatch of the other. You would not, however, want ........C or ...G..... because these are submatches of the maximal match for 1,2,3. I still believe this is a better way to find all possible matches. Perhaps the way to best suit it to your needs would be to implement a filter that finds the maximal match for each sequence subset. The following update to my original code does the trick. (Note that true and uniq require List::MoreUtils.

my %maxMatch; for my $key (keys %$match) { @{$match->{$key}} = uniq @{$match->{$key}}; my $seqSet = join ",", @{$match->{$key}}; unless ($maxMatch{$seqSet}) { $maxMatch{$seqSet} = $key; next; } my $old = true { /[^\.]/ } split //, $maxMatch{$seqSet}; my $new = true { /[^\.]/ } split //, $key; $maxMatch{$seqSet} = $key if $new > $old; } for my $key (keys %maxMatch) { print $maxMatch{$key}, " occurs in $key \n"; }

Prints (for the four sequence set above):
..G.AT... occurs in 0,3 .C.G....C occurs in 1,2 ...G....C occurs in 1,2,3 .C....... occurs in 0,1,2 AC....T.. occurs in 0,1

Update: I was flipping through my Perl in a Nutshell text and discovered that grep could (and probably should) be used in place of true in the code above. I tested both and they work the same, although grep may run a little faster since it's a built-in command.

Replies are listed 'Best First'.
Re^2: pattern finding algorithm
by d0nuts10 (Initiate) on Aug 09, 2007 at 17:53 UTC
    hello dogz, i tried using the messaging but i couldnt see a text box to type in. ok im a little new here but i was wondering if you could help me in find a pattern in some longer codes that deal with both numbers and letters if you can please email me at tottinum1@gmail.com
Re^2: pattern finding algorithm
by kdt2006 (Initiate) on Aug 01, 2007 at 13:02 UTC

    hi dogz07,

    Many thanks, this is what I was looking for.

    Thanks everyone for all the help. Limal, I will take a look at your suggestion.

    Many thanks again