in reply to Re: Removing Flanking "N"s in a DNA String
in thread Removing Flanking "N"s in a DNA String

With reference to your regex, I always find it a bit sadomasochistic to use regular expression meta characters to delimit a regular expression, it does kind of muddy the waters.

Also, your regex is quite inefficient, as perl has jump through hoops to save the value captured in the parentheses. It's quicker to "top and tail" the string over two lines. Three times quicker in fact...

#!/perl -w use strict; use Benchmark; timethese(200000, { 'single' => sub { my $dna = "NNNNATCGNNNTCGANNN"; $dna =~ s/^N*(.*?)N*$/$1/; }, 'twin' => sub { my $dna = "NNNNATCGNNNTCGANNN"; $dna =~ s/^N*//; $dna =~ s/N*$//; }, }); __OUTPUT__ Benchmark: timing 200000 iterations of single, twin... single: 3 wallclock secs ( 2.95 usr + 0.00 sys = 2.95 CPU) @ 67 +704.81/s (n=200000) twin: 1 wallclock secs ( 0.83 usr + 0.00 sys = 0.83 CPU) @ 24 +0963.86/s (n=200000)
---
my name's not Keith, and I'm not reasonable.

Replies are listed 'Best First'.
Re^3: Removing Flanking "N"s in a DNA String
by BrowserUk (Patriarch) on Nov 07, 2005 at 15:47 UTC
    With reference to your regex, I always find it a bit sadomasochistic to use regular expression meta characters to delimit a regular expression, it does kind of muddy the waters.

    I prefer to use balanced delimiters for several reasons, not least of which is consistancy.

    See 506374, though I have say the whole argument for using two statements rather than one seems like the most obvious case of premature optimisation--and yet it seems enshrined in the FAQ.

    If your benchmark is correct, then you're talking of saving 13.18 microseconds per trim, which means you'll save a whole second every 75,872 trims you perform.

    I also think that your benchmark is flawed, but I'll wait until mine has been picked apart before I argue that case :)


    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    Lingua non convalesco, consenesco et abolesco. -- Rule 1 has a caveat! -- Who broke the cabal?
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.
      I prefer to use balanced delimiters for several reasons, not least of which is consistancy.
      I'm with you on that one, but you could use curly brackets. That way your regexes wouldn't look like big character classes.
      two statements rather than one seems like the most obvious case of premature optimisation
      The OP specifically asked for an efficient way of doing this. I saw your capture and thought it was unnecessary.
      I also think that your benchmark is flawed
      Well I didn't type out the results by hand :P

      Having said that, I've no idea how they would be affected by different data, but only the OP can do that properly, as we don't have a sample of the actual data to be proceessed.

      ---
      my name's not Keith, and I'm not reasonable.
Re^3: Removing Flanking "N"s in a DNA String
by Perl Mouse (Chaplain) on Nov 07, 2005 at 16:05 UTC
    If speed is your game, it's better to write s/^N+//; s/N+$// instead of using *'s. If there are no leading (or trailing) N's, it's better for the regex to fail, than the replace an empty string with an empty string.
    Perl --((8:>*