Re: How to get non-redundant DNA sequences from a FASTA file?

When you hear "unique", think "hash". In this case, you need to hash headers by sequences:

#!/usr/bin/perl
use warnings;
use strict;

my $fasta = << '__FASTA__';
>gi1 cds
ATG fun
>gi2 cds
ATG fun
>gi3 cds
GGG fun
__FASTA__

my @seq_with_hdr = split /\n>/, $fasta;
$seq_with_hdr[0] =~ s/^>//;

my %hdr_by_seq;

for (@seq_with_hdr) {
    my ($hdr, $seq) = split /\n/;
    $hdr_by_seq{$seq} = $hdr;
}

for my $seq (keys %hdr_by_seq) {
    print ">$hdr_by_seq{$seq}\n$seq\n"
}
[download]

Note that whitespace is not ignored in the data. There was a space after one of "ATG FUN" sequences which makes it different to the same sequence without the trailing space. I removed the space in my code.

لսႽ† ᥲᥒ⚪⟊Ⴙᘓᖇ Ꮅᘓᖇ⎱ Ⴙᥲ𝇋ƙᘓᖇ

Comment on Re: How to get non-redundant DNA sequences from a FASTA file? Download Code

Replies are listed 'Best First'.
Re^2: How to get non-redundant DNA sequences from a FASTA file? by supriyoch_2008 (Monk) on Sep 13, 2014 at 12:48 UTC
Hi Choroba, Thank you very much for fixing the problem and providing me valuable suggestions regarding unique (array) and whitespace. I shall follow your suggestions. I searched in google for fixing this problem using perl code but I didn't get any such information. But I found a script based on Java program which I do not know. The URL for java solution is http://seqanswers.com/forums/showthread.php?t=4442 So, I wrote to perl monks for help. With regards	[reply]

Replies are listed 'Best First'.

Re^2: How to get non-redundant DNA sequences from a FASTA file?
by supriyoch_2008 (Monk) on Sep 13, 2014 at 12:48 UTC

Hi Choroba,

Thank you very much for fixing the problem and providing me valuable suggestions regarding unique (array) and whitespace. I shall follow your suggestions. I searched in google for fixing this problem using perl code but I didn't get any such information. But I found a script based on Java program which I do not know. The URL for java solution is http://seqanswers.com/forums/showthread.php?t=4442

So, I wrote to perl monks for help.

With regards

[reply]