Re: Maximal Parsimony Problem
by moritz (Cardinal) on Sep 03, 2008 at 14:05 UTC
|
# this heuristic seems to be faulty
my $min_mm = scalar( uniq(@colbp) ) - 1;
To me this seems perfectly right. If you want to speed up things a bit more (this is only a micro optimization) you can use a hash in the first place:
my %chars;
foreach my $site ( @{$tfbs} ) {
my $bp = substr($site,$pos,1);
$chars{$bp} = 1;
}
# this heuristic seems to be faulty
my $min_mm = keys %chars - 1;
push @mincol, $min_mm;
Instead I think that your sample output is wrong:
# is:
00100000302011000000100
# should be:
00100000301011000000100
If not, could you please explain how to get the 2 there? | [reply] [d/l] [select] |
|
|
| [reply] |
|
|
I still don't quite understand. We have
human: C
chimp: T
mouse: T
rat: C
So which comparisons exactly lead to to a number of 2? With which rules?
As a poor perl programmer I'm not familiar with the phylogenetic tree, so if that's relevant here please explain in which relation these four examples appear in that tree. | [reply] [d/l] |
|
|
Have you changed your example?
A
C
T
G
How does that become 3?
And the column moritz quoted above used to bee:C
T
T
T
| [reply] [d/l] [select] |
Re: Maximal Parsimony Problem
by BioNrd (Monk) on Sep 03, 2008 at 16:40 UTC
|
While I can not give you a direct answer to your problem. I can tell you that there are other options out there that you should take a look at.
If you do want to work on the code yourself bioperl may be a good place to look. I myself have not worked with the tools there, but I get the sense others have done similar things to what you are trying to do. So there maybe something there that is helpful for you.
There are also numerous programs that grappled with Fitch's algorithm (and other algorithms for that matter), that might avoid the trouble of reinventing the wheel from scratch. For some helpful links to programs you might want to check out the Workshop on Molecular Evolution.
Sorry I can't help more than that.
Bio.
----
Even a blind squirrel finds a nut sometimes.
| [reply] |
Re: Maximal Parsimony Problem
by psini (Deacon) on Sep 03, 2008 at 15:26 UTC
|
I think that the idea is that given the tree:
$tree = '(mouse,rat,(human,chimp))'
the values in column 11 should be treated like this: (T,C,(C,T)) and so the value of 2 results from T ne C ne (C,T); whereas for instance column 3 leads to (G,A,(A,A)) and G ne A eq (A,A).
This is just a wild guess but, if I understand the logic right, (T,C,(A,G))=>3 and (T,C,(C,A))=>2. If it is so, the algorithm should not be so simple when more complex trees are involved.
Rule One: "Do not act incautiously when confronting a little bald wrinkly smiling man."
| [reply] [d/l] |
Re: Maximal Parsimony Problem
by SFLEX (Chaplain) on Sep 03, 2008 at 15:58 UTC
|
| [reply] |
Re: Maximal Parsimony Problem
by psini (Deacon) on Sep 03, 2008 at 16:26 UTC
|
Just a simple question: am I wrong or, from a cladistic point of view, any node of the tree should have exactly zero or two descendant, one of which is a terminal node?
For if this is the case the problem becomes quite a trivial one, but your example doesn't match my hypothesis.
Update: forget the second condition, it's bullshit.
But I think that the first one holds, for you can have ((rat,mouse),(human,chimp)) or ((rat,(mouse,(human,chimp)) or (mouse,(rat,(human,chimp)) but it make no sense (rat,mouse,(human,chimp)), particularly if you are checking the variation score independently on each base. For the total score to make sense you should assume one more common ancestor or the score for base X will not be consistent with that of base Y.
Rule One: "Do not act incautiously when confronting a little bald wrinkly smiling man."
| [reply] |
|
|
| [reply] |
Re: Maximal Parsimony Problem
by dwm042 (Priest) on Sep 03, 2008 at 19:09 UTC
|
neversaint, the problem here is that you're giving us genetic data and you're not giving us the genetic tree. Fitch's algorithm solves the small parsimony problem. The small parsimony problem requires a genetic tree as input. With the data we have been given in this node, you're asking us to solve the large parsimony problem, which is much much harder, and is also NP complete.
| [reply] |
|
|
| [reply] |
Re: Maximal Parsimony Problem
by psini (Deacon) on Sep 03, 2008 at 20:55 UTC
|
Assuming (as in my previous post) that the phylogenetic tree is a binary tree, the following code should work.
use strict;
use warnings;
use Data::Dumper;
my $sites={'mouse'=>'CGGCGGAAAACTGTCCTCCGTGC',
'rat'=> 'CGACGGAACATTCTCCTCCGCGC',
'human'=>'CGACGGAATATTCCCCTCCGTGC',
'chimp'=>'CGACGGAAGACTCTCCTCCGTGC'};
my $tree = [['mouse','rat'],['human','chimp']];
# Sanity checks
my $len=-1;
treeCheck($tree,$sites,\$len);
my @val;
$val[$_]=parsimony(extractLocus($tree,$sites,$_))->[0] for (0..$len-1)
+;
print join('',@val)."\n";
sub treeCheck {
my ($subTree,$sites,$len)=@_;
if (ref($subTree) eq 'ARRAY') {
die "Not a binary tree" if @$subTree != 2;
treeCheck($subTree->[0],$sites,$len);
treeCheck($subTree->[1],$sites,$len);
} else {
die "Invalid char" if $sites->{$subTree} !~ /^[ATCG]+$/;
if ($$len<0) {
$$len=length($sites->{$subTree});
} else {
die "Invalid length" if length($sites->{$subTree}) != $$len;
}
}
}
sub extractLocus {
my ($subTree,$sites,$index)=@_;
if (ref($subTree) eq 'ARRAY') {
return [extractLocus($subTree->[0],$sites,$index),extractLocus($su
+bTree->[1],$sites,$index)];
} else {
my $base=substr($sites->{$subTree},$index,1);
$base =~ tr/ATCG/1248/;
return $base;
}
}
sub parsimony {
my $locusTree=shift;
if (ref($locusTree) eq 'ARRAY') {
my $left=parsimony($locusTree->[0]);
my $right=parsimony($locusTree->[1]);
my $count=$left->[0]+$right->[0];
$count++ if ($left->[1] & $right->[1])==0;
my $bases=$left->[1] | $right->[1];
return [$count,$bases]
} else {
return[0,$locusTree];
}
}
I have significantly changed the input data structures, so $tree is "really" a tree (an AoA..oA) containing the names of the species and $sites is an hash using the species as key. It works with your original data but changing the structure of the tree gives (as to be expected) different results:
With $tree = [['mouse','rat'],['human','chimp']];
gives 00100000302011000000100
With $tree = ['mouse',['rat',['human','chimp']]];
gives 00100000301011000000100
Rule One: "Do not act incautiously when confronting a little bald wrinkly smiling man."
| [reply] [d/l] [select] |
Re: Maximal Parsimony Problem
by bioinformatics (Friar) on Sep 03, 2008 at 20:19 UTC
|
So, if you use the fixes from the posts above, you should be able to output a minimum change in each base from the transcription factor binding site. In the top up phase it looks like you are clustering the individual bases to determine the sequence variability and relations in the sequence itself. By recursively clustering these, you should be able to output a second value that serves as a "score" for the sequence variation, which then allows you to cluster again in the top down version to determine the genotypic relation of the various species. A simple way to cluster them would be to use a sub to recursively find the difference of the "values" your getting for each base associated with each species, pushing them into an array, and them using them to create a second value which is then used to determine the larger phylogenetic tree. It would be more efficient to use a matrix to sort and output those values so that the changes would be more fluid and dynamic and use less memory, etc. Bioperl has a couple modules that are already made to determine the distances in the various nodes, so you could integrate these into your program to get the larger phylogenetic output you desire.
Check out Module:Bio::Tree::Node, as well as Module:Bio::Align::DNAStatistics, which if I recall can do something of what your trying to do above. Good luck!
| [reply] |
Re: Maximal Parsimony Problem
by GrandFather (Saint) on Sep 03, 2008 at 21:57 UTC
|
I have no understanding of the problem domain so the following code is not a solution to your actual problem, but is an alternative way of finding changes between adjacent strings in a list and counting those changes by column.
use strict;
use warnings;
my $sites = [
'CGGCGGAAAACTGTCCTCCGTGC', # mouse
'CGACGGAACATTCTCCTCCGCGC', # rat
'CGACGGAATATTCCCCTCCGTGC', # human
'CGACGGAAGACTCTCCTCCGTGC', # chimp
];
my @val = my_parsimony ($sites);
print @val;
sub my_parsimony {
my ($sites) = @_;
my $lastSite = $sites->[0];
my @mincol = (0) x length $lastSite;
for my $nextSite (@$sites) {
my $changes = $lastSite ^ $nextSite;
$lastSite = $nextSite;
next unless $changes =~ s/[^\0]/+/g;
my $pos = 0;
++$mincol[$pos++] while -1 != ($pos = index $changes, '+', $po
+s);
}
return @mincol;
}
Prints:
00100000302012000000200
Perl reduces RSI - it saves typing
| [reply] [d/l] [select] |
Re: Maximal Parsimony Problem
by SFLEX (Chaplain) on Sep 03, 2008 at 16:58 UTC
|
What I think is happening is since uniq() will return a 1 if all in that column is the same, but what i think the bug is. If all in that column is the same you want it to return a zero (scalar( uniq(@colbp) ) - 1;), but the problem with that is if there all different for that column with 4 rows you would want a 4 returned?
edited spelling.. =p | [reply] [d/l] |