kdt2006, please correct if I'm wrong, but I read your last reply to mean that you want the maximal match in each subset of sequences. For example, I added a new sequence to my code and got the following results:
my $seq = [ [ qw(A C G C A T T C A) ],
[ qw(A C T G G A T A C) ],
[ qw(T C A G C C A T C) ],
[ qw(C T G G A T C G C) ] ];
Outputs:
..G.AT... occurs in 0,3
...G....C occurs in 1,2,3
.C......C occurs in 1,2
A.....T.. occurs in 0,1
........C occurs in 1,2,3
...G..... occurs in 1,2,3
A........ occurs in 0,1
....AT... occurs in 0,3
AC....... occurs in 0,1
..G.A.... occurs in 0,3
..G..T... occurs in 0,3
.C.G....C occurs in 1,2
..G...... occurs in 0,3
......T.. occurs in 0,1
.C.G..... occurs in 1,2
.C....T.. occurs in 0,1
AC....T.. occurs in 0,1
.....T... occurs in 0,3
....A.... occurs in 0,3
.C....... occurs in 0,1,2
Notice above how the maximal match between 1 and 2 is .C.G....C, while a submatch of this, ...G....C, is the maximal match between 1, 2, and 3. From what I understand you would want both of these, even though one is a submatch of the other. You would not, however, want ........C or ...G..... because these are submatches of the maximal match for 1,2,3. I still believe this is a better way to find
all possible matches. Perhaps the way to best suit it to your needs would be to implement a filter that finds the maximal match for each sequence subset. The following update to my original code does the trick. (Note that
true and
uniq require List::MoreUtils.
my %maxMatch;
for my $key (keys %$match) {
@{$match->{$key}} = uniq @{$match->{$key}};
my $seqSet = join ",", @{$match->{$key}};
unless ($maxMatch{$seqSet}) { $maxMatch{$seqSet} = $key; next; }
my $old = true { /[^\.]/ } split //, $maxMatch{$seqSet};
my $new = true { /[^\.]/ } split //, $key;
$maxMatch{$seqSet} = $key if $new > $old;
}
for my $key (keys %maxMatch) {
print $maxMatch{$key}, " occurs in $key \n";
}
Prints (for the four sequence set above):
..G.AT... occurs in 0,3
.C.G....C occurs in 1,2
...G....C occurs in 1,2,3
.C....... occurs in 0,1,2
AC....T.. occurs in 0,1
Update: I was flipping through my
Perl in a Nutshell text and discovered that
grep could (and probably
should) be used in place of
true in the code above. I tested both and they work the same, although
grep may run a little faster since it's a built-in command.
Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
Read Where should I post X? if you're not absolutely sure you're posting in the right place.
Please read these before you post! —
Posts may use any of the Perl Monks Approved HTML tags:
- a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
|
For: |
|
Use: |
| & | | & |
| < | | < |
| > | | > |
| [ | | [ |
| ] | | ] |
Link using PerlMonks shortcuts! What shortcuts can I use for linking?
See Writeup Formatting Tips and other pages linked from there for more info.