uvnew has asked for the wisdom of the Perl Monks concerning the following question:

Hi guys. I've got a small problem that I couldn't solve by browsing Bioperl. I have a DNA sequence with an ID. I would like to delete 'DL;' from the ID, and to delete all '-' and '*' from the sequence. If for example this is my sequence and ID:
>DL;H1_ENSP00000194530_chr2_202024 CCCC---GCCTTCTCGCTGCCCAGC--CCCGGGGAGGGAGG*
Then I would like it to become:
>H1_ENSP00000194530_chr2_202024 CCCCGCCTTCTCGCTGCCCAGCCCCGGGGAGGGAGG
Each sequence is actually a few hundred letters, I assume it doesn't matter.

Thanks a lot for any idea!

Replies are listed 'Best First'.
Re: Substitution on a sequence
by ww (Archbishop) on Jan 29, 2007 at 15:51 UTC
    my $data =">DL;H1_ENSP00000194530_chr2_202024 CCCC---GCCTTCTCGCTGCCCAG +C--CCCGGGGAGGGAGG*"; $data =~ s/[-*]|DL;//g; # globaly, replace all "-" or "*" OR "DL;" wi +th nothing # ignores possibility "DL;" appears elsewhere + in data # if that's an issue, you might want to do tw +o substitutions # $data =~ s/[-*]//g; and $data =~ s/(>)DL;/ +$1/g; # though that last is NOT tied to the beginni +ng of the line # which appears to be subject to brain_block +at the moment print $data; =head OUTPUT perl dataclean.pl >H1_ENSP00000194530_chr2_202024 CCCCGCCTTCTCGCTGCCCAGCCCCGGGGAGGGAGG =cut
Re: Substitution on a sequence
by BrowserUk (Patriarch) on Jan 29, 2007 at 16:30 UTC

    Update: Credited the wrong person.

    I'd suggest a slight modification to ww's method. As this is a FASTA file, I'd read the file record by record (sequence by sequence), rather than line by line. The following one-liner ought to work, but is untested.

    perl -e"BEGIN{$/=qq[\n>]}" -wpe"s[[-*]|DL;][]g" theFile > theOuput

    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.

      Tested (albeit, non-rigorously) BrowserUK's with a data file (very modestly varied from OP) eq:

      >DL;H1_ENSP00000194530_chr2_202024 CCCC---GCCTTCTCGCTGCCCAGC--CCCGGGGA +GGGAGG* ">DL;H2_ENSP00000194530_chr2_202024 CCCC---GCCTTCTCGCTGCCCAGC--CCCGGGG +AGGGAGG*" line 2 >DL;H1_ENSP00000194530_chr2_202024 CCCC---GCCTTCDLTCGCTGCCCAGC--CCCGGG +GAGGGAGG* line3 >DL;H1_ENSP00000194530_chr2_202024 CCCC---GCCTTCTCGCTGCCCAGC--CCCGGGGA +GGGAGG line 4 >DL;H1_ENSP00000194530_chr2_202024 CCCC---GCCTTCTCGCTGCCCAGC--CCCGG*GG +AGGGAGG line 5

      and output is:

      >H1_ENSP00000194530_chr2_202024 CCCCGCCTTCTCGCTGCCCAGCCCCGGGGAGGGAGG ">H2_ENSP00000194530_chr2_202024 CCCCGCCTTCTCGCTGCCCAGCCCCGGGGAGGGAGG" + line 2 >H1_ENSP00000194530_chr2_202024 CCCCGCCTTCDLTCGCTGCCCAGCCCCGGGGAGGGAGG + line3 >H1_ENSP00000194530_chr2_202024 CCCCGCCTTCTCGCTGCCCAGCCCCGGGGAGGGAGG l +ine 4 >H1_ENSP00000194530_chr2_202024 CCCCGCCTTCTCGCTGCCCAGCCCCGGGGAGGGAGG l +ine 5

      Nice, BrowserUK; ++

      Update: Fixed the mis-attribution. Give BrowserUK another ++ and I'll do penance in the dungeon; the more so, since it was he who answered a brain_dead question about his code.

Re: Substitution on a sequence
by davorg (Chancellor) on Jan 29, 2007 at 15:40 UTC
Re: Substitution on a sequence
by glasswalk3r (Friar) on Jan 29, 2007 at 17:39 UTC

    I believe you want that:

    DL;H1_ENSP00000194530_chr2_202024 CCCC---GCCTTCTCGCTGCCCAGC--CCCGGGGAGGGAGG*
    

    To became:

    H1_ENSP00000194530_chr2_202024 CCCCGCCTTCTCGCTGCCCAGCCCCGGGGAGGGAGG
    

    A very naive approach would be using tr and s to remove the undesired characters, like this:

    # in a loop while reading $_ =~ tr/-//d; $_ =~ s/^DL\;//o; $_ =~ s/\*$//o;

    Of course, I'm unaware about any rule to remove those characters, since you didn't mentioned anything about it.

    Alceu Rodrigues de Freitas Junior
    ---------------------------------
    "You have enemies? Good. That means you've stood up for something, sometime in your life." - Sir Winston Churchill