in reply to How to get non-redundant DNA sequences from a FASTA file?

When you hear "unique", think "hash". In this case, you need to hash headers by sequences:
#!/usr/bin/perl use warnings; use strict; my $fasta = << '__FASTA__'; >gi1 cds ATG fun >gi2 cds ATG fun >gi3 cds GGG fun __FASTA__ my @seq_with_hdr = split /\n>/, $fasta; $seq_with_hdr[0] =~ s/^>//; my %hdr_by_seq; for (@seq_with_hdr) { my ($hdr, $seq) = split /\n/; $hdr_by_seq{$seq} = $hdr; } for my $seq (keys %hdr_by_seq) { print ">$hdr_by_seq{$seq}\n$seq\n" }

Note that whitespace is not ignored in the data. There was a space after one of "ATG FUN" sequences which makes it different to the same sequence without the trailing space. I removed the space in my code.

لսႽ† ᥲᥒ⚪⟊Ⴙᘓᖇ Ꮅᘓᖇ⎱ Ⴙᥲ𝇋ƙᘓᖇ

Replies are listed 'Best First'.
Re^2: How to get non-redundant DNA sequences from a FASTA file?
by supriyoch_2008 (Monk) on Sep 13, 2014 at 12:48 UTC

    Hi Choroba,

    Thank you very much for fixing the problem and providing me valuable suggestions regarding unique (array) and whitespace. I shall follow your suggestions. I searched in google for fixing this problem using perl code but I didn't get any such information. But I found a script based on Java program which I do not know. The URL for java solution is http://seqanswers.com/forums/showthread.php?t=4442

    So, I wrote to perl monks for help.

    With regards