in reply to Re^2: sort sequences and keep ID of them
in thread sort sequences and keep ID of them
is this possible!!!!
First of all, as grandfather pointed out earlier there is lot of info missing, like:
1. Is this suppose to be a fasta formated entry or not ???.
2. Do all duplicated sequences have the same ID (I guess not, but you didn't specify...)
3. How would you choose which sequence(seq header) you wanna keap and which you wanna leave out ??
Now if I'm correct and this is a fasta entry and seq headers are not the same then, what I would do is load the sequences a hash (bio-perl can help you with fasta entries) such that the seq body is the key and header the value. As a result you will and you will automatically get a unique set of sequences...
NOW why "Oh my, oh my, oh my"....
well for large set of strings the above method is not something what I would recommend, but rather I would try to enforce a different strategy. So when you posted the question first thing that fell to my mind was, a Trie(keyword tree), so I started to search the CPAN db for a module but I couldn't find it!!!!
So my question to other monks is ; Is the parser for the trie data structure stored under some strange name or there really is no module for it? Furthermore, is somebody working on it already or should I do it, since lately I do a lot of programming involving suffix trees, tries, suffix arrays and so on ....??
Cheers
baxy
|
|---|