Removing Flanking "N"s in a DNA String

monkfan has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: Removing Flanking "N"s in a DNA String by BrowserUk (Patriarch) on Nov 07, 2005 at 13:43 UTC
`perl> $dna = "NNNNATCGNNNTCGANNN";; perl> $dna =~ s[^N(.?)N*$][$1];; perl> print $dna;; ATCGNNNTCGA` [download] Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. Lingua non convalesco, consenesco et abolesco. -- Rule 1 has a caveat! -- Who broke the cabal? "Science is about questioning the status quo. Questioning authority". In the absence of evidence, opinion is indistinguishable from prejudice.	[reply] [d/l]
Re^2: Removing Flanking "N"s in a DNA String by reasonablekeith (Deacon) on Nov 07, 2005 at 15:24 UTC
With reference to your regex, I always find it a bit sadomasochistic to use regular expression meta characters to delimit a regular expression, it does kind of muddy the waters. Also, your regex is quite inefficient, as perl has jump through hoops to save the value captured in the parentheses. It's quicker to "top and tail" the string over two lines. Three times quicker in fact... `#!/perl -w use strict; use Benchmark; timethese(200000, { 'single' => sub { my $dna = "NNNNATCGNNNTCGANNN"; $dna =~ s/^N(.?)N$/$1/; }, 'twin' => sub { my $dna = "NNNNATCGNNNTCGANNN"; $dna =~ s/^N//; $dna =~ s/N$//; }, }); __OUTPUT__ Benchmark: timing 200000 iterations of single, twin... single: 3 wallclock secs ( 2.95 usr + 0.00 sys = 2.95 CPU) @ 67 +704.81/s (n=200000) twin: 1 wallclock secs ( 0.83 usr + 0.00 sys = 0.83 CPU) @ 24 +0963.86/s (n=200000)` [download] --- my name's not Keith, and I'm not reasonable.*	[reply] [d/l]
Re^3: Removing Flanking "N"s in a DNA String by BrowserUk (Patriarch) on Nov 07, 2005 at 15:47 UTC
With reference to your regex, I always find it a bit sadomasochistic to use regular expression meta characters to delimit a regular expression, it does kind of muddy the waters. I prefer to use balanced delimiters for several reasons, not least of which is consistancy. See 506374, though I have say the whole argument for using two statements rather than one seems like the most obvious case of premature optimisation--and yet it seems enshrined in the FAQ. If your benchmark is correct, then you're talking of saving 13.18 microseconds per trim, which means you'll save a whole second every 75,872 trims you perform. I also think that your benchmark is flawed, but I'll wait until mine has been picked apart before I argue that case :) Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. Lingua non convalesco, consenesco et abolesco. -- Rule 1 has a caveat! -- Who broke the cabal? "Science is about questioning the status quo. Questioning authority". In the absence of evidence, opinion is indistinguishable from prejudice.	[reply]
Re^4: Removing Flanking "N"s in a DNA String by reasonablekeith (Deacon) on Nov 07, 2005 at 16:11 UTC
Re^3: Removing Flanking "N"s in a DNA String by Perl Mouse (Chaplain) on Nov 07, 2005 at 16:05 UTC
If speed is your game, it's better to write `s/^N+//; s/N+$//` instead of using 's. If there are no leading (or trailing) N's, it's better for the regex to fail, than the replace an empty string with an empty string. `Perl --((8:>`	[reply] [d/l]
Re^2: Removing Flanking "N"s in a DNA String by Aristotle (Chancellor) on Nov 07, 2005 at 15:21 UTC
Note that this is inefficient. The non-greedy quantifier will cause the engine to try matching the trailing `N` part whenever it can, so in the case it will match into the middle NNNN part before finding that the end-of-string doesn’t follow and backtracking out of it. In very nearly every case where you want to do something at the start of the string and at the end of the string, you should use two anchored substituions. And what’s more important is, that’s more readable too. Makeshifts last the longest.*	[reply]
Re^3: Removing Flanking "N"s in a DNA String by BrowserUk (Patriarch) on Nov 07, 2005 at 15:24 UTC
Really, then what is wrong with my benchmark? #! perl -slw use strict; use Benchmark qw[cmpthese]; my @tests = map{ my $s = 'N' x $_; ( "${s}S${s}S", "${s}S${s}S${s}", "S${s}S${s}" ) } map{ $_ * 10 } 1 .. 100; @$_ = @tests for \ our( @a, @b, @c, @d ); cmpthese 10, { a=>q[ s[^N(.?)N$][$1] for @a ], b=>q[ s[N$][] and s[^n][] for @b ], c=>q[ s[N$][] and s[^n][] for @c ], d=>q[ s[^N(.?)N$][$1] for @d ], }; __END__ P:\test>junk Rate b c a d b 6.10/s -- -1% -27% -28% c 6.15/s 1% -- -26% -27% a 8.31/s 36% 35% -- -1% d 8.42/s 38% 37% 1% -- [download]	[reply] [d/l]
Re^4: Removing Flanking "N"s in a DNA String by Aristotle (Chancellor) on Nov 07, 2005 at 16:10 UTC
Re^5: Removing Flanking "N"s in a DNA String by BrowserUk (Patriarch) on Nov 07, 2005 at 16:25 UTC
Some notes below your chosen depth have not been shown here
Re: Removing Flanking "N"s in a DNA String by inman (Curate) on Nov 07, 2005 at 13:43 UTC
A pair of substitutions with a regular expression anchored to the start and end of the string will help. `$dna = "NNNNATCGNNNTCGANNN"; $dna =~ s/^[N]+//; $dna =~ s/[N]+$//; print "$dna\n" __END__ ATCGNNNTCGA` [download]	[reply] [d/l]
Re^2: Removing Flanking "N"s in a DNA String by ikegami (Patriarch) on Nov 07, 2005 at 13:52 UTC
No need for the square brackets: `$dna = "NNNNATCGNNNTCGANNN"; $dna =~ s/^N+//; $dna =~ s/N+$//; print "$dna\n" __END__ ATCGNNNTCGA` [download]	[reply] [d/l]
Re^2: Removing Flanking "N"s in a DNA String by davidrw (Prior) on Nov 07, 2005 at 15:17 UTC
As metioned already, no need for the character class.. Also, (although i tend to prefer the 2 separate statements as well), this can be written as one: `$dna =~ s/(^N+\|N+$)//g;` [download]	[reply] [d/l]
Re^3: Removing Flanking "N"s in a DNA String by blazar (Canon) on Nov 08, 2005 at 08:52 UTC
Also, `s/(?:^N+\|N+$)//g;` [download] since we don't really need capturing. Which makes me think that ideally one may want it, instead, to see the text that has been substituted. But then he would only get the N's at `^` (if any) as $1 or . Which in turn make me wonder why there's this asymmetry between `s///` and `m//`, since the latter, in list context and with `/g` does return all captures while the former just returns a true (or false) value in any context, precisely the numer of matches (or the empty string)...	[reply] [d/l] [select]
Re^2: Removing Flanking "N"s in a DNA String by blazar (Canon) on Nov 07, 2005 at 14:49 UTC
Why do you need those charachter classes? `s/^N+//, s/N+$// for $dna;` [download]	[reply] [d/l]