uvnew has asked for the wisdom of the Perl Monks concerning the following question:

Hey guys. I have about 500 DNA sequences, each one consists of an ID and sequence letters, and a line space between each sequence. I need to rearrange the sequences according to their ID number, in ascending order. The ID for every sequence is the ENSP00000xxxxx So if for example my original sequences file is:
>ENSP00000314624 GCACAATGGTAGAGGCAGATCATCC >ENSP00000347089 ATGGATTGCTGTGCCTCTCGAGGCT >ENSP00000301587 TGACCCACTTCCGTTACTTGCTGCG
then I would like it to be:
>ENSP00000301587 TGACCCACTTCCGTTACTTGCTGCG >ENSP00000314624 GCACAATGGTAGAGGCAGATCATCC >ENSP00000347089 ATGGATTGCTGTGCCTCTCGAGGCT
I would truly appreciate any suggestion.

Cheers,

uv

Replies are listed 'Best First'.
Re: Sorting strings
by ikegami (Patriarch) on Jan 25, 2007 at 17:33 UTC

    Since the number of right-padded with zeroes, sorting in plain old lexical order will do the trick.

    my @unsorted = ( '>ENSP00000314624 GCACAATGGTAGAGGCAGATCATCC', '>ENSP00000347089 ATGGATTGCTGTGCCTCTCGAGGCT', '>ENSP00000301587 TGACCCACTTCCGTTACTTGCTGCG', ); my @sorted = sort @unsorted; print("$_\n") foreach @sorted;

    Ref: sort

    Update: My post was written when the OP wasn't formatted. It wasn't evidentant that there were line breaks in the data. The answer is still the same, however. The key is to create an array of records instead of an array of lines. mreece shows one way of doing this.

Re: Sorting strings
by shigetsu (Hermit) on Jan 25, 2007 at 17:36 UTC
    As a side note: Have you considered using bioperl?
Re: Sorting strings
by mreece (Friar) on Jan 26, 2007 at 02:26 UTC
    you can set the input record separator to \n\n to read the paragraph format you have there. then the default sort will work, if the prefix is the same on every line.
    #!perl use strict; use warnings; my @data; { local $/ = "\n\n"; @data = <DATA>; } my @sorted = sort @data; print @sorted; __DATA__ >ENSP00000314624 GCACAATGGTAGAGGCAGATCATCC >ENSP00000347089 ATGGATTGCTGTGCCTCTCGAGGCT >ENSP00000301587 TGACCCACTTCCGTTACTTGCTGCG
    produces
    >ENSP00000301587
    TGACCCACTTCCGTTACTTGCTGCG
    
    >ENSP00000314624
    GCACAATGGTAGAGGCAGATCATCC
    
    >ENSP00000347089
    ATGGATTGCTGTGCCTCTCGAGGCT
    
    
      Thanks a lot, that works perfectly! Can you recommend a good tutorial for regular expressions? I think I really need that... Cheers, uv