Hi Perlmonks

I am interested in getting the non-redundant DNA sequences from a FASTA file. Two sequences may have different headers but have the same DNA sequence. I want any one of these two sequences with its header in the output, not both. I have written a script (try.pl) but it does not give the results I want. I seek help from perl monks to fix this problem.

Here goes the script try.pl:

#!/usr/bin/perl use warnings; $a=">gi1 cds ATG fun >gi2 cds ATG fun >gi3 cds GGG fun"; while ($a=~ m/(>.*?fun)/gs) { $b1=$&; $b2=$&; while ($b1=~ />.*?cds/gs) { $h=$&; $b2=~ s/$h//g; $b2=~ s/fun//g; $seq=$b2; $seq=~ s/\n//; $hdr_seq="$h\n"."$seq\n"; push @hdr_seq1,$hdr_seq; push @only_seq1,$seq; } } # To remove multiple copies (if any): my %seen; # declare a hash my @only_seq=(); my @hdr_seq=(); @only_seq=grep{!$seen{$_}++}@only_seq1; @hdr_seq=grep{!$seen{$_}++}@hdr_seq1; print "\n\n A. Header & sequences are:\n\n"; print join ("\n", @hdr_seq); print "\n"; print "\n B. Only sequences are:\n\n"; print join ("\n", @only_seq); print "\n\n"; $num=0; foreach my $item1 (@only_seq) {$num++; # No.1 curly $seq1=$item1; foreach my $item2 (@hdr_seq) { # No.2 curly if (defined $item2) { $item2=$item2; $item3=$item2;} while ($item2=~ m/>.*cds/gs) { $hdr2=$&; $item3=~ s/$hdr2//; $item3=~ s/\s//; $seq2=$item3; $ele2="$hdr2\n"."$seq2\n"; if ($seq1 eq $seq2) {push @result1,$ele2;} else {push @result1,$ele2;} } } # No.2 curly } # No.1 curly ###################################### my @result=(); @result=grep{!$seen{$_}++}@result1; print "\n C. Non-redundant sequences are:\n\n"; print join ("",@result); print "\n"; exit;
##################

The results of the script go like:

Microsoft Windows [Version 6.1.7600] Copyright (c) 2009 Microsoft Corporation. All rights reserved. C:\Users\x\Desktop>try.pl A. Header & sequences are: >gi1 cds ATG >gi2 cds ATG >gi3 cds GGG B. Only sequences are: ATG GGG C. Non-redundant sequences are: (This is wrong) >gi1 cds ATG >gi2 cds ATG >gi3 cds GGG

Correct results for Non-redundant sequences should be like:

>gi1 cds ATG >gi3 cds GGG

In reply to How to get non-redundant DNA sequences from a FASTA file? by supriyoch_2008

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.