Hello supriyoch_2008,
Same approach as choroba, except that I read the data into a hash mapping headers to sequences, and then reverse that hash to get the non-redundant sequences:
#! perl use strict; use warnings; my $fasta = '>gi1 cds ATG fun >gi2 cds ATG fun >gi3 cds GGG fun'; my %hdrs; $hdrs{$1} = $2 while $fasta =~ / > (.+) \s+ cds \s+ (.*) \s+ fun /g +x; print " A. Header & sequences are:\n"; printf ">%s cds\n%s\n", $_, $hdrs{$_} for sort keys %hdrs; my %seqs; while (my ($key, $value) = each %hdrs) { push @{$seqs{$value}}, $key; } print " B. Only sequences are:\n"; printf "$_\n" for sort keys %seqs; print " C. Non-redundant sequences are:\n"; printf ">%s cds\n%s\n", ( sort @{$seqs{$_}} )[0], $_ for sort keys %se +qs;
Output:
18:47 >perl 1009_SoPW.pl A. Header & sequences are: >gi1 cds ATG >gi2 cds ATG >gi3 cds GGG B. Only sequences are: ATG GGG C. Non-redundant sequences are: >gi1 cds ATG >gi3 cds GGG 18:47 >
Note: on hash reversal, see How-do-I-look-up-a-hash-element-by-value of perlfaq4.
If you really don’t care which header is output when there are redundant sequences, you can just say:
my %seqs = reverse %hdrs; ... print " C. Non-redundant sequences are:\n"; printf ">%s cds\n%s\n", $seqs{$_}, $_ for sort keys %seqs;
When your code starts to get too complicated, it’s usually a good idea to step back and look for a simpler approach. Remember, less is more. ;-)
Hope that helps,
| Athanasius <°(((>< contra mundum | Iustus alius egestas vitae, eros Piratica, |
In reply to Re: How to get non-redundant DNA sequences from a FASTA file?
by Athanasius
in thread How to get non-redundant DNA sequences from a FASTA file?
by supriyoch_2008
| For: | Use: | ||
| & | & | ||
| < | < | ||
| > | > | ||
| [ | [ | ||
| ] | ] |