Substitution on a sequence

uvnew has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: Substitution on a sequence by ww (Archbishop) on Jan 29, 2007 at 15:51 UTC
my $data =">DL;H1_ENSP00000194530_chr2_202024 CCCC---GCCTTCTCGCTGCCCAG +C--CCCGGGGAGGGAGG"; $data =~ s/[-]\|DL;//g; # globaly, replace all "-" or "" OR "DL;" wi +th nothing # ignores possibility "DL;" appears elsewhere + in data # if that's an issue, you might want to do tw +o substitutions # $data =~ s/[-]//g; and $data =~ s/(>)DL;/ +$1/g; # though that last is NOT tied to the beginni +ng of the line # which appears to be subject to brain_block +at the moment print $data; =head OUTPUT perl dataclean.pl >H1_ENSP00000194530_chr2_202024 CCCCGCCTTCTCGCTGCCCAGCCCCGGGGAGGGAGG =cut [download]	[reply] [d/l]
Re: Substitution on a sequence by BrowserUk (Patriarch) on Jan 29, 2007 at 16:30 UTC
Update: Credited the wrong person. I'd suggest a slight modification to ww's method. As this is a FASTA file, I'd read the file record by record (sequence by sequence), rather than line by line. The following one-liner ought to work, but is untested. `perl -e"BEGIN{$/=qq[\n>]}" -wpe"s[[-*]\|DL;][]g" theFile > theOuput` [download] Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. "Science is about questioning the status quo. Questioning authority". In the absence of evidence, opinion is indistinguishable from prejudice. "Too many [] have been sedated by an oppressive environment of political correctness and risk aversion."	[reply] [d/l]
Re^2: Substitution on a sequence by ww (Archbishop) on Jan 29, 2007 at 16:55 UTC
Tested (albeit, non-rigorously) BrowserUK's with a data file (very modestly varied from OP) eq: `>DL;H1_ENSP00000194530_chr2_202024 CCCC---GCCTTCTCGCTGCCCAGC--CCCGGGGA +GGGAGG* ">DL;H2_ENSP00000194530_chr2_202024 CCCC---GCCTTCTCGCTGCCCAGC--CCCGGGG +AGGGAGG" line 2 >DL;H1_ENSP00000194530_chr2_202024 CCCC---GCCTTCDLTCGCTGCCCAGC--CCCGGG +GAGGGAGG line3 >DL;H1_ENSP00000194530_chr2_202024 CCCC---GCCTTCTCGCTGCCCAGC--CCCGGGGA +GGGAGG line 4 >DL;H1_ENSP00000194530_chr2_202024 CCCC---GCCTTCTCGCTGCCCAGC--CCCGGGG +AGGGAGG line 5` [download] and output is: `>H1_ENSP00000194530_chr2_202024 CCCCGCCTTCTCGCTGCCCAGCCCCGGGGAGGGAGG ">H2_ENSP00000194530_chr2_202024 CCCCGCCTTCTCGCTGCCCAGCCCCGGGGAGGGAGG" + line 2 >H1_ENSP00000194530_chr2_202024 CCCCGCCTTCDLTCGCTGCCCAGCCCCGGGGAGGGAGG + line3 >H1_ENSP00000194530_chr2_202024 CCCCGCCTTCTCGCTGCCCAGCCCCGGGGAGGGAGG l +ine 4 >H1_ENSP00000194530_chr2_202024 CCCCGCCTTCTCGCTGCCCAGCCCCGGGGAGGGAGG l +ine 5` [download] Nice, BrowserUK; ++ Update*: Fixed the mis-attribution. Give BrowserUK another ++ and I'll do penance in the dungeon; the more so, since it was he who answered a brain_dead question about his code.	[reply] [d/l] [select]
Re: Substitution on a sequence by davorg (Chancellor) on Jan 29, 2007 at 15:40 UTC
What bit are you having trouble with? What have you tried already? Can we see some (short) code? -- <http://dave.org.uk> "The first rule of Perl club is you do not talk about Perl club." -- Chip Salzenberg	[reply]
Re: Substitution on a sequence by glasswalk3r (Friar) on Jan 29, 2007 at 17:39 UTC
I believe you want that: DL;H1_ENSP00000194530_chr2_202024 CCCC---GCCTTCTCGCTGCCCAGC--CCCGGGGAGGGAGG* To became: H1_ENSP00000194530_chr2_202024 CCCCGCCTTCTCGCTGCCCAGCCCCGGGGAGGGAGG A very naive approach would be using `tr` and `s` to remove the undesired characters, like this: `# in a loop while reading $_ =~ tr/-//d; $_ =~ s/^DL\;//o; $_ =~ s/\*$//o;` [download] Of course, I'm unaware about any rule to remove those characters, since you didn't mentioned anything about it. Alceu Rodrigues de Freitas Junior --------------------------------- "You have enemies? Good. That means you've stood up for something, sometime in your life." - Sir Winston Churchill	[reply] [d/l] [select]