sort sequences and keep ID of them

Diane4Luo has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: sort sequences and keep ID of them by GrandFather (Saint) on May 21, 2011 at 23:33 UTC
What does your data look like and what do you expect the result to look like? maybe you can provide a runnable sample with the minimum data required to demonstrate the problem included in the sample code? I know what I mean. Why don't you? may help you understand what we would like to see. True laziness is hard work	[reply]
Re: sort sequences and keep ID of them by Anonymous Monk on May 21, 2011 at 23:29 UTC
And how is it not working?	[reply]
Re^2: sort sequences and keep ID of them by Diane4Luo (Initiate) on May 22, 2011 at 00:27 UTC
I found out I didn't set the path for my file correctly. Thanks for the suggest. My sort is working now. Great! The next step is to extract unique sequence from the list and keep the IDs for each unique sequence. Any suggestion to do this? Thanks, Sample sequence in a file. There are 120,000 sequences in one file: `>11EZ4_FR1a_3LNSF7V ACCTCTGGCTTCACCTTTACCAACTATGCCATGACCTGGGTCCGCCAGACTCCAGGGAAGGGCCTGGAGT +GGCTTTCAG GCATTAGTGGTGGTGGTGATATCATACACTATGCAGACTTCGTGAAGGGCCGGTTCACCGTCTCCAGAGA +CGATTCTAA GAGCACACTGTTTCTGCAAATGACCGGCCTGAGAGCCGAAGACTCGGCCGTGTATTATTGTGCGAGAAGG +CGTGTACGT CAGGGAGGCACCTACTACTACTACATGGACTTCTGGGGCAAAGGGACCACG` [download]	[reply] [d/l]
Re^3: sort sequences and keep ID of them "Oh my, oh my, oh my!!!" by baxy77bax (Deacon) on May 22, 2011 at 10:07 UTC
Oh my, oh my, oh my, is this possible!!!! First of all, as grandfather pointed out earlier there is lot of info missing, like: 1. Is this suppose to be a fasta formated entry or not ???. 2. Do all duplicated sequences have the same ID (I guess not, but you didn't specify...) 3. How would you choose which sequence(seq header) you wanna keap and which you wanna leave out ?? Now if I'm correct and this is a fasta entry and seq headers are not the same then, what I would do is load the sequences a hash (bio-perl can help you with fasta entries) such that the seq body is the key and header the value. As a result you will and you will automatically get a unique set of sequences... NOW why "Oh my, oh my, oh my".... well for large set of strings the above method is not something what I would recommend, but rather I would try to enforce a different strategy. So when you posted the question first thing that fell to my mind was, a Trie(keyword tree), so I started to search the CPAN db for a module but I couldn't find it!!!! So my question to other monks is ; Is the parser for the trie data structure stored under some strange name or there really is no module for it? Furthermore, is somebody working on it already or should I do it, since lately I do a lot of programming involving suffix trees, tries, suffix arrays and so on ....?? Cheers baxy	[reply]