in reply to How to get non-redundant DNA sequences from a FASTA file?
Hello supriyoch_2008,
Same approach as choroba, except that I read the data into a hash mapping headers to sequences, and then reverse that hash to get the non-redundant sequences:
#! perl use strict; use warnings; my $fasta = '>gi1 cds ATG fun >gi2 cds ATG fun >gi3 cds GGG fun'; my %hdrs; $hdrs{$1} = $2 while $fasta =~ / > (.+) \s+ cds \s+ (.*) \s+ fun /g +x; print " A. Header & sequences are:\n"; printf ">%s cds\n%s\n", $_, $hdrs{$_} for sort keys %hdrs; my %seqs; while (my ($key, $value) = each %hdrs) { push @{$seqs{$value}}, $key; } print " B. Only sequences are:\n"; printf "$_\n" for sort keys %seqs; print " C. Non-redundant sequences are:\n"; printf ">%s cds\n%s\n", ( sort @{$seqs{$_}} )[0], $_ for sort keys %se +qs;
Output:
18:47 >perl 1009_SoPW.pl A. Header & sequences are: >gi1 cds ATG >gi2 cds ATG >gi3 cds GGG B. Only sequences are: ATG GGG C. Non-redundant sequences are: >gi1 cds ATG >gi3 cds GGG 18:47 >
Note: on hash reversal, see How-do-I-look-up-a-hash-element-by-value of perlfaq4.
If you really don’t care which header is output when there are redundant sequences, you can just say:
my %seqs = reverse %hdrs; ... print " C. Non-redundant sequences are:\n"; printf ">%s cds\n%s\n", $seqs{$_}, $_ for sort keys %seqs;
When your code starts to get too complicated, it’s usually a good idea to step back and look for a simpler approach. Remember, less is more. ;-)
Hope that helps,
| Athanasius <°(((>< contra mundum | Iustus alius egestas vitae, eros Piratica, |
|
|---|
| Replies are listed 'Best First'. | |
|---|---|
|
Re^2: How to get non-redundant DNA sequences from a FASTA file?
by supriyoch_2008 (Monk) on Sep 13, 2014 at 12:54 UTC |