Diane4Luo has asked for the wisdom of the Perl Monks concerning the following question:

I am trying to count and record how many unique sequences and keep the ID for each sequence. I use perl sort(quick sort) function, it doesn't work. Here is my code:
foreach $value (sort {$SeqSet{$a} cmp $SeqSet{$b}} keys %SeqSet) { print "$value $SeqSet{$value}\n"; }
Thanks a lot! Diane

Replies are listed 'Best First'.
Re: sort sequences and keep ID of them
by GrandFather (Saint) on May 21, 2011 at 23:33 UTC

    What does your data look like and what do you expect the result to look like? maybe you can provide a runnable sample with the minimum data required to demonstrate the problem included in the sample code? I know what I mean. Why don't you? may help you understand what we would like to see.

    True laziness is hard work
Re: sort sequences and keep ID of them
by Anonymous Monk on May 21, 2011 at 23:29 UTC
    And how is it not working?

      I found out I didn't set the path for my file correctly. Thanks for the suggest. My sort is working now. Great! The next step is to extract unique sequence from the list and keep the IDs for each unique sequence. Any suggestion to do this? Thanks, Sample sequence in a file. There are 120,000 sequences in one file:

      >11EZ4_FR1a_3LNSF7V ACCTCTGGCTTCACCTTTACCAACTATGCCATGACCTGGGTCCGCCAGACTCCAGGGAAGGGCCTGGAGT +GGCTTTCAG GCATTAGTGGTGGTGGTGATATCATACACTATGCAGACTTCGTGAAGGGCCGGTTCACCGTCTCCAGAGA +CGATTCTAA GAGCACACTGTTTCTGCAAATGACCGGCCTGAGAGCCGAAGACTCGGCCGTGTATTATTGTGCGAGAAGG +CGTGTACGT CAGGGAGGCACCTACTACTACTACATGGACTTCTGGGGCAAAGGGACCACG
        Oh my, oh my, oh my,

        is this possible!!!!
        First of all, as grandfather pointed out earlier there is lot of info missing, like:

        1. Is this suppose to be a fasta formated entry or not ???.
        2. Do all duplicated sequences have the same ID (I guess not, but you didn't specify...)
        3. How would you choose which sequence(seq header) you wanna keap and which you wanna leave out ??

        Now if I'm correct and this is a fasta entry and seq headers are not the same then, what I would do is load the sequences a hash (bio-perl can help you with fasta entries) such that the seq body is the key and header the value. As a result you will and you will automatically get a unique set of sequences...

        NOW why "Oh my, oh my, oh my"....

        well for large set of strings the above method is not something what I would recommend, but rather I would try to enforce a different strategy. So when you posted the question first thing that fell to my mind was, a Trie(keyword tree), so I started to search the CPAN db for a module but I couldn't find it!!!!

        So my question to other monks is ; Is the parser for the trie data structure stored under some strange name or there really is no module for it? Furthermore, is somebody working on it already or should I do it, since lately I do a lot of programming involving suffix trees, tries, suffix arrays and so on ....??

        Cheers

        baxy