Maximal Parsimony Problem

neversaint has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: Maximal Parsimony Problem by moritz (Cardinal) on Sep 03, 2008 at 14:05 UTC
`# this heuristic seems to be faulty my $min_mm = scalar( uniq(@colbp) ) - 1;` [download] To me this seems perfectly right. If you want to speed up things a bit more (this is only a micro optimization) you can use a hash in the first place: `my %chars; foreach my $site ( @{$tfbs} ) { my $bp = substr($site,$pos,1); $chars{$bp} = 1; } # this heuristic seems to be faulty my $min_mm = keys %chars - 1; push @mincol, $min_mm;` [download] Instead I think that your sample output is wrong: `# is: 00100000302011000000100 # should be: 00100000301011000000100` [download] If not, could you please explain how to get the 2 there?	[reply] [d/l] [select]
Re^2: Maximal Parsimony Problem by neversaint (Deacon) on Sep 03, 2008 at 14:28 UTC
Hi Moritz, It is based on phylogenetic tree, the substitution comparison from lower nodes must be compared to higher level. E.g. For column 11, the bases for row 3 is compared to row 1. Same result will also be given if we compare row 3 to row 2. Both row 1 and row 2 has higher level than row 3 row 4 in phylogenetic tree. --- neversaint and everlastingly indebted.......	[reply]
Re^3: Maximal Parsimony Problem by moritz (Cardinal) on Sep 03, 2008 at 14:36 UTC
I still don't quite understand. We have `human: C chimp: T mouse: T rat: C` [download] So which comparisons exactly lead to to a number of 2? With which rules? As a poor perl programmer I'm not familiar with the phylogenetic tree, so if that's relevant here please explain in which relation these four examples appear in that tree.	[reply] [d/l]
Re^3: Maximal Parsimony Problem by lidden (Curate) on Sep 03, 2008 at 15:29 UTC
Have you changed your example? `A C T G` [download] How does that become 3? And the column moritz quoted above used to bee: `C T T T` [download]	[reply] [d/l] [select]
Re: Maximal Parsimony Problem by BioNrd (Monk) on Sep 03, 2008 at 16:40 UTC
While I can not give you a direct answer to your problem. I can tell you that there are other options out there that you should take a look at. If you do want to work on the code yourself bioperl may be a good place to look. I myself have not worked with the tools there, but I get the sense others have done similar things to what you are trying to do. So there maybe something there that is helpful for you. There are also numerous programs that grappled with Fitch's algorithm (and other algorithms for that matter), that might avoid the trouble of reinventing the wheel from scratch. For some helpful links to programs you might want to check out the Workshop on Molecular Evolution. Sorry I can't help more than that. Bio. ---- Even a blind squirrel finds a nut sometimes.	[reply]
Re: Maximal Parsimony Problem by psini (Deacon) on Sep 03, 2008 at 15:26 UTC
I think that the idea is that given the tree: `$tree = '(mouse,rat,(human,chimp))'` the values in column 11 should be treated like this: (T,C,(C,T)) and so the value of 2 results from T ne C ne (C,T); whereas for instance column 3 leads to (G,A,(A,A)) and G ne A eq (A,A). This is just a wild guess but, if I understand the logic right, (T,C,(A,G))=>3 and (T,C,(C,A))=>2. If it is so, the algorithm should not be so simple when more complex trees are involved. Rule One: "Do not act incautiously when confronting a little bald wrinkly smiling man."	[reply] [d/l]
Re: Maximal Parsimony Problem by SFLEX (Chaplain) on Sep 03, 2008 at 15:58 UTC
Are these topics close to what your asking? Walter M. Fitch , Phylogenetic tree, distance matrix From what i gather its fairly complex.	[reply]
Re: Maximal Parsimony Problem by psini (Deacon) on Sep 03, 2008 at 16:26 UTC
Just a simple question: am I wrong or, from a cladistic point of view, any node of the tree should have exactly zero or two descendant~~, one of which is a terminal node~~? For if this is the case the problem becomes quite a trivial one, but your example doesn't match my hypothesis. Update: forget the second condition, it's bullshit. But I think that the first one holds, for you can have ((rat,mouse),(human,chimp)) or ((rat,(mouse,(human,chimp)) or (mouse,(rat,(human,chimp)) but it make no sense (rat,mouse,(human,chimp)), particularly if you are checking the variation score independently on each base. For the total score to make sense you should assume one more common ancestor or the score for base X will not be consistent with that of base Y. Rule One: "Do not act incautiously when confronting a little bald wrinkly smiling man."	[reply]
Re^2: Maximal Parsimony Problem by neversaint (Deacon) on Sep 04, 2008 at 03:55 UTC
Dear psini, Thank you so much for your thoughtful replies. You are right, that the phylogenetic tree is always a binary tree. I guess the problem I gave you is the tree "representation". Newick's format gives a general-ordered tree. And it seems that we always can convert this general-ordered tree into a binary tree. --- neversaint and everlastingly indebted.......	[reply]
Re: Maximal Parsimony Problem by dwm042 (Priest) on Sep 03, 2008 at 19:09 UTC
neversaint, the problem here is that you're giving us genetic data and you're not giving us the genetic tree. Fitch's algorithm solves the small parsimony problem. The small parsimony problem requires a genetic tree as input. With the data we have been given in this node, you're asking us to solve the large parsimony problem, which is much much harder, and is also NP complete.	[reply]
Re^2: Maximal Parsimony Problem by neversaint (Deacon) on Sep 04, 2008 at 01:45 UTC
Dear dwm042, You are right. It is a small parsimony problem. BTW, from the start I have already included the "genetic tree input" in my sample code. --- neversaint and everlastingly indebted.......	[reply]
Re: Maximal Parsimony Problem by psini (Deacon) on Sep 03, 2008 at 20:55 UTC
Assuming (as in my previous post) that the phylogenetic tree is a binary tree, the following code should work. use strict; use warnings; use Data::Dumper; my $sites={'mouse'=>'CGGCGGAAAACTGTCCTCCGTGC', 'rat'=> 'CGACGGAACATTCTCCTCCGCGC', 'human'=>'CGACGGAATATTCCCCTCCGTGC', 'chimp'=>'CGACGGAAGACTCTCCTCCGTGC'}; my $tree = [['mouse','rat'],['human','chimp']]; # Sanity checks my $len=-1; treeCheck($tree,$sites,\$len); my @val; $val[$_]=parsimony(extractLocus($tree,$sites,$_))->[0] for (0..$len-1) +; print join('',@val)."\n"; sub treeCheck { my ($subTree,$sites,$len)=@_; if (ref($subTree) eq 'ARRAY') { die "Not a binary tree" if @$subTree != 2; treeCheck($subTree->[0],$sites,$len); treeCheck($subTree->[1],$sites,$len); } else { die "Invalid char" if $sites->{$subTree} !~ /^[ATCG]+$/; if ($$len<0) { $$len=length($sites->{$subTree}); } else { die "Invalid length" if length($sites->{$subTree}) != $$len; } } } sub extractLocus { my ($subTree,$sites,$index)=@_; if (ref($subTree) eq 'ARRAY') { return [extractLocus($subTree->[0],$sites,$index),extractLocus($su +bTree->[1],$sites,$index)]; } else { my $base=substr($sites->{$subTree},$index,1); $base =~ tr/ATCG/1248/; return $base; } } sub parsimony { my $locusTree=shift; if (ref($locusTree) eq 'ARRAY') { my $left=parsimony($locusTree->[0]); my $right=parsimony($locusTree->[1]); my $count=$left->[0]+$right->[0]; $count++ if ($left->[1] & $right->[1])==0; my $bases=$left->[1] \| $right->[1]; return [$count,$bases] } else { return[0,$locusTree]; } } [download] I have significantly changed the input data structures, so $tree is "really" a tree (an AoA..oA) containing the names of the species and $sites is an hash using the species as key. It works with your original data but changing the structure of the tree gives (as to be expected) different results: `With $tree = [['mouse','rat'],['human','chimp']]; gives 00100000302011000000100 With $tree = ['mouse',['rat',['human','chimp']]]; gives 00100000301011000000100` [download] Rule One: "Do not act incautiously when confronting a little bald wrinkly smiling man."	[reply] [d/l] [select]
Re: Maximal Parsimony Problem by bioinformatics (Friar) on Sep 03, 2008 at 20:19 UTC
So, if you use the fixes from the posts above, you should be able to output a minimum change in each base from the transcription factor binding site. In the top up phase it looks like you are clustering the individual bases to determine the sequence variability and relations in the sequence itself. By recursively clustering these, you should be able to output a second value that serves as a "score" for the sequence variation, which then allows you to cluster again in the top down version to determine the genotypic relation of the various species. A simple way to cluster them would be to use a sub to recursively find the difference of the "values" your getting for each base associated with each species, pushing them into an array, and them using them to create a second value which is then used to determine the larger phylogenetic tree. It would be more efficient to use a matrix to sort and output those values so that the changes would be more fluid and dynamic and use less memory, etc. Bioperl has a couple modules that are already made to determine the distances in the various nodes, so you could integrate these into your program to get the larger phylogenetic output you desire. Check out Module:Bio::Tree::Node, as well as Module:Bio::Align::DNAStatistics, which if I recall can do something of what your trying to do above. Good luck! Bioinformatics	[reply]
Re: Maximal Parsimony Problem by GrandFather (Saint) on Sep 03, 2008 at 21:57 UTC
I have no understanding of the problem domain so the following code is not a solution to your actual problem, but is an alternative way of finding changes between adjacent strings in a list and counting those changes by column. use strict; use warnings; my $sites = [ 'CGGCGGAAAACTGTCCTCCGTGC', # mouse 'CGACGGAACATTCTCCTCCGCGC', # rat 'CGACGGAATATTCCCCTCCGTGC', # human 'CGACGGAAGACTCTCCTCCGTGC', # chimp ]; my @val = my_parsimony ($sites); print @val; sub my_parsimony { my ($sites) = @_; my $lastSite = $sites->[0]; my @mincol = (0) x length $lastSite; for my $nextSite (@$sites) { my $changes = $lastSite ^ $nextSite; $lastSite = $nextSite; next unless $changes =~ s/[^\0]/+/g; my $pos = 0; ++$mincol[$pos++] while -1 != ($pos = index $changes, '+', $po +s); } return @mincol; } [download] Prints: `00100000302012000000200` [download] Perl reduces RSI - it saves typing	[reply] [d/l] [select]
Re: Maximal Parsimony Problem by SFLEX (Chaplain) on Sep 03, 2008 at 16:58 UTC
What I think is happening is since uniq() will return a 1 if all in that column is the same, but what i think the bug is. If all in that column is the same you want it to return a zero (`scalar( uniq(@colbp) ) - 1;`), but the problem with that is if there all different for that column with 4 rows you would want a 4 returned? edited spelling.. =p	[reply] [d/l]