supriyoch_2008 has asked for the wisdom of the Perl Monks concerning the following question:
Hi Perlmonks
I am interested in getting the non-redundant DNA sequences from a FASTA file. Two sequences may have different headers but have the same DNA sequence. I want any one of these two sequences with its header in the output, not both. I have written a script (try.pl) but it does not give the results I want. I seek help from perl monks to fix this problem.
Here goes the script try.pl:
###################!/usr/bin/perl use warnings; $a=">gi1 cds ATG fun >gi2 cds ATG fun >gi3 cds GGG fun"; while ($a=~ m/(>.*?fun)/gs) { $b1=$&; $b2=$&; while ($b1=~ />.*?cds/gs) { $h=$&; $b2=~ s/$h//g; $b2=~ s/fun//g; $seq=$b2; $seq=~ s/\n//; $hdr_seq="$h\n"."$seq\n"; push @hdr_seq1,$hdr_seq; push @only_seq1,$seq; } } # To remove multiple copies (if any): my %seen; # declare a hash my @only_seq=(); my @hdr_seq=(); @only_seq=grep{!$seen{$_}++}@only_seq1; @hdr_seq=grep{!$seen{$_}++}@hdr_seq1; print "\n\n A. Header & sequences are:\n\n"; print join ("\n", @hdr_seq); print "\n"; print "\n B. Only sequences are:\n\n"; print join ("\n", @only_seq); print "\n\n"; $num=0; foreach my $item1 (@only_seq) {$num++; # No.1 curly $seq1=$item1; foreach my $item2 (@hdr_seq) { # No.2 curly if (defined $item2) { $item2=$item2; $item3=$item2;} while ($item2=~ m/>.*cds/gs) { $hdr2=$&; $item3=~ s/$hdr2//; $item3=~ s/\s//; $seq2=$item3; $ele2="$hdr2\n"."$seq2\n"; if ($seq1 eq $seq2) {push @result1,$ele2;} else {push @result1,$ele2;} } } # No.2 curly } # No.1 curly ###################################### my @result=(); @result=grep{!$seen{$_}++}@result1; print "\n C. Non-redundant sequences are:\n\n"; print join ("",@result); print "\n"; exit;
The results of the script go like:
Microsoft Windows [Version 6.1.7600] Copyright (c) 2009 Microsoft Corporation. All rights reserved. C:\Users\x\Desktop>try.pl A. Header & sequences are: >gi1 cds ATG >gi2 cds ATG >gi3 cds GGG B. Only sequences are: ATG GGG C. Non-redundant sequences are: (This is wrong) >gi1 cds ATG >gi2 cds ATG >gi3 cds GGG
Correct results for Non-redundant sequences should be like:
>gi1 cds ATG >gi3 cds GGG
|
|---|
| Replies are listed 'Best First'. | |
|---|---|
|
Re: How to get non-redundant DNA sequences from a FASTA file?
by choroba (Cardinal) on Sep 13, 2014 at 08:49 UTC | |
by supriyoch_2008 (Monk) on Sep 13, 2014 at 12:48 UTC | |
|
Re: How to get non-redundant DNA sequences from a FASTA file?
by Athanasius (Archbishop) on Sep 13, 2014 at 09:08 UTC | |
by supriyoch_2008 (Monk) on Sep 13, 2014 at 12:54 UTC | |
|
Re: How to get non-redundant DNA sequences from a FASTA file?
by biohisham (Priest) on Sep 15, 2014 at 05:48 UTC | |
by Anonymous Monk on Sep 06, 2015 at 12:14 UTC | |
by Not_a_Number (Prior) on Sep 06, 2015 at 12:32 UTC | |
by Anonymous Monk on Sep 07, 2015 at 11:08 UTC |