in reply to How to get non-redundant DNA sequences from a FASTA file?

Hello supriyoch_2008,

Same approach as choroba, except that I read the data into a hash mapping headers to sequences, and then reverse that hash to get the non-redundant sequences:

#! perl use strict; use warnings; my $fasta = '>gi1 cds ATG fun >gi2 cds ATG fun >gi3 cds GGG fun'; my %hdrs; $hdrs{$1} = $2 while $fasta =~ / > (.+) \s+ cds \s+ (.*) \s+ fun /g +x; print " A. Header & sequences are:\n"; printf ">%s cds\n%s\n", $_, $hdrs{$_} for sort keys %hdrs; my %seqs; while (my ($key, $value) = each %hdrs) { push @{$seqs{$value}}, $key; } print " B. Only sequences are:\n"; printf "$_\n" for sort keys %seqs; print " C. Non-redundant sequences are:\n"; printf ">%s cds\n%s\n", ( sort @{$seqs{$_}} )[0], $_ for sort keys %se +qs;

Output:

18:47 >perl 1009_SoPW.pl A. Header & sequences are: >gi1 cds ATG >gi2 cds ATG >gi3 cds GGG B. Only sequences are: ATG GGG C. Non-redundant sequences are: >gi1 cds ATG >gi3 cds GGG 18:47 >

Note: on hash reversal, see How-do-I-look-up-a-hash-element-by-value of perlfaq4.

If you really don’t care which header is output when there are redundant sequences, you can just say:

my %seqs = reverse %hdrs; ... print " C. Non-redundant sequences are:\n"; printf ">%s cds\n%s\n", $seqs{$_}, $_ for sort keys %seqs;

When your code starts to get too complicated, it’s usually a good idea to step back and look for a simpler approach. Remember, less is more. ;-)

Hope that helps,

Athanasius <°(((><contra mundum Iustus alius egestas vitae, eros Piratica,

Replies are listed 'Best First'.
Re^2: How to get non-redundant DNA sequences from a FASTA file?
by supriyoch_2008 (Monk) on Sep 13, 2014 at 12:54 UTC

    Hi Athanasius,

    Thank you for your suggestions and fixing the problem. Your script has worked nicely. I am grateful to you.

    With regards,