monkfan has asked for the wisdom of the Perl Monks concerning the following question:

Fellow monks,
I want to remove the flanking Ns at the both ends of this string.
$dna = "NNNNATCGNNNTCGANNN";
Such that it gives:
$noflank_end = "ATCGNNNTCGA";
Keeping the Ns clipped by ATCG base.
Is there a simple and efficient way to do this?

Regards,
Edward

Replies are listed 'Best First'.
Re: Removing Flanking "N"s in a DNA String
by BrowserUk (Patriarch) on Nov 07, 2005 at 13:43 UTC

    perl> $dna = "NNNNATCGNNNTCGANNN";; perl> $dna =~ s[^N*(.*?)N*$][$1];; perl> print $dna;; ATCGNNNTCGA

    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    Lingua non convalesco, consenesco et abolesco. -- Rule 1 has a caveat! -- Who broke the cabal?
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.
      With reference to your regex, I always find it a bit sadomasochistic to use regular expression meta characters to delimit a regular expression, it does kind of muddy the waters.

      Also, your regex is quite inefficient, as perl has jump through hoops to save the value captured in the parentheses. It's quicker to "top and tail" the string over two lines. Three times quicker in fact...

      #!/perl -w use strict; use Benchmark; timethese(200000, { 'single' => sub { my $dna = "NNNNATCGNNNTCGANNN"; $dna =~ s/^N*(.*?)N*$/$1/; }, 'twin' => sub { my $dna = "NNNNATCGNNNTCGANNN"; $dna =~ s/^N*//; $dna =~ s/N*$//; }, }); __OUTPUT__ Benchmark: timing 200000 iterations of single, twin... single: 3 wallclock secs ( 2.95 usr + 0.00 sys = 2.95 CPU) @ 67 +704.81/s (n=200000) twin: 1 wallclock secs ( 0.83 usr + 0.00 sys = 0.83 CPU) @ 24 +0963.86/s (n=200000)
      ---
      my name's not Keith, and I'm not reasonable.
        With reference to your regex, I always find it a bit sadomasochistic to use regular expression meta characters to delimit a regular expression, it does kind of muddy the waters.

        I prefer to use balanced delimiters for several reasons, not least of which is consistancy.

        See 506374, though I have say the whole argument for using two statements rather than one seems like the most obvious case of premature optimisation--and yet it seems enshrined in the FAQ.

        If your benchmark is correct, then you're talking of saving 13.18 microseconds per trim, which means you'll save a whole second every 75,872 trims you perform.

        I also think that your benchmark is flawed, but I'll wait until mine has been picked apart before I argue that case :)


        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        Lingua non convalesco, consenesco et abolesco. -- Rule 1 has a caveat! -- Who broke the cabal?
        "Science is about questioning the status quo. Questioning authority".
        In the absence of evidence, opinion is indistinguishable from prejudice.
        If speed is your game, it's better to write s/^N+//; s/N+$// instead of using *'s. If there are no leading (or trailing) N's, it's better for the regex to fail, than the replace an empty string with an empty string.
        Perl --((8:>*

      Note that this is inefficient. The non-greedy quantifier will cause the engine to try matching the trailing N* part whenever it can, so in the case it will match into the middle NNNN part before finding that the end-of-string doesn’t follow and backtracking out of it.

      In very nearly every case where you want to do something at the start of the string and at the end of the string, you should use two anchored substituions.

      And what’s more important is, that’s more readable too.

      Makeshifts last the longest.

        Really, then what is wrong with my benchmark?

        #! perl -slw use strict; use Benchmark qw[cmpthese]; my @tests = map{ my $s = 'N' x $_; ( "${s}S${s}S", "${s}S${s}S${s}", "S${s}S${s}" ) } map{ $_ * 10 } 1 .. 100; @$_ = @tests for \ our( @a, @b, @c, @d ); cmpthese 10, { a=>q[ s[^N*(.*?)N*$][$1] for @a ], b=>q[ s[N*$][] and s[^n*][] for @b ], c=>q[ s[N*$][] and s[^n*][] for @c ], d=>q[ s[^N*(.*?)N*$][$1] for @d ], }; __END__ P:\test>junk Rate b c a d b 6.10/s -- -1% -27% -28% c 6.15/s 1% -- -26% -27% a 8.31/s 36% 35% -- -1% d 8.42/s 38% 37% 1% --
Re: Removing Flanking "N"s in a DNA String
by inman (Curate) on Nov 07, 2005 at 13:43 UTC
    A pair of substitutions with a regular expression anchored to the start and end of the string will help.
    $dna = "NNNNATCGNNNTCGANNN"; $dna =~ s/^[N]+//; $dna =~ s/[N]+$//; print "$dna\n" __END__ ATCGNNNTCGA
      No need for the square brackets:
      $dna = "NNNNATCGNNNTCGANNN"; $dna =~ s/^N+//; $dna =~ s/N+$//; print "$dna\n" __END__ ATCGNNNTCGA
      As metioned already, no need for the character class.. Also, (although i tend to prefer the 2 separate statements as well), this can be written as one:
      $dna =~ s/(^N+|N+$)//g;
        Also,
        s/(?:^N+|N+$)//g;
        since we don't really need capturing. Which makes me think that ideally one may want it, instead, to see the text that has been substituted. But then he would only get the N's at ^ (if any) as $1 or . Which in turn make me wonder why there's this asymmetry between s/// and m//, since the latter, in list context and with /g does return all captures while the former just returns a true (or false) value in any context, precisely the numer of matches (or the empty string)...
      Why do you need those charachter classes?
      s/^N+//, s/N+$// for $dna;