in reply to Removing Flanking "N"s in a DNA String

perl> $dna = "NNNNATCGNNNTCGANNN";; perl> $dna =~ s[^N*(.*?)N*$][$1];; perl> print $dna;; ATCGNNNTCGA

Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
Lingua non convalesco, consenesco et abolesco. -- Rule 1 has a caveat! -- Who broke the cabal?
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.

Replies are listed 'Best First'.
Re^2: Removing Flanking "N"s in a DNA String
by reasonablekeith (Deacon) on Nov 07, 2005 at 15:24 UTC
    With reference to your regex, I always find it a bit sadomasochistic to use regular expression meta characters to delimit a regular expression, it does kind of muddy the waters.

    Also, your regex is quite inefficient, as perl has jump through hoops to save the value captured in the parentheses. It's quicker to "top and tail" the string over two lines. Three times quicker in fact...

    #!/perl -w use strict; use Benchmark; timethese(200000, { 'single' => sub { my $dna = "NNNNATCGNNNTCGANNN"; $dna =~ s/^N*(.*?)N*$/$1/; }, 'twin' => sub { my $dna = "NNNNATCGNNNTCGANNN"; $dna =~ s/^N*//; $dna =~ s/N*$//; }, }); __OUTPUT__ Benchmark: timing 200000 iterations of single, twin... single: 3 wallclock secs ( 2.95 usr + 0.00 sys = 2.95 CPU) @ 67 +704.81/s (n=200000) twin: 1 wallclock secs ( 0.83 usr + 0.00 sys = 0.83 CPU) @ 24 +0963.86/s (n=200000)
    ---
    my name's not Keith, and I'm not reasonable.
      With reference to your regex, I always find it a bit sadomasochistic to use regular expression meta characters to delimit a regular expression, it does kind of muddy the waters.

      I prefer to use balanced delimiters for several reasons, not least of which is consistancy.

      See 506374, though I have say the whole argument for using two statements rather than one seems like the most obvious case of premature optimisation--and yet it seems enshrined in the FAQ.

      If your benchmark is correct, then you're talking of saving 13.18 microseconds per trim, which means you'll save a whole second every 75,872 trims you perform.

      I also think that your benchmark is flawed, but I'll wait until mine has been picked apart before I argue that case :)


      Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
      Lingua non convalesco, consenesco et abolesco. -- Rule 1 has a caveat! -- Who broke the cabal?
      "Science is about questioning the status quo. Questioning authority".
      In the absence of evidence, opinion is indistinguishable from prejudice.
        I prefer to use balanced delimiters for several reasons, not least of which is consistancy.
        I'm with you on that one, but you could use curly brackets. That way your regexes wouldn't look like big character classes.
        two statements rather than one seems like the most obvious case of premature optimisation
        The OP specifically asked for an efficient way of doing this. I saw your capture and thought it was unnecessary.
        I also think that your benchmark is flawed
        Well I didn't type out the results by hand :P

        Having said that, I've no idea how they would be affected by different data, but only the OP can do that properly, as we don't have a sample of the actual data to be proceessed.

        ---
        my name's not Keith, and I'm not reasonable.
      If speed is your game, it's better to write s/^N+//; s/N+$// instead of using *'s. If there are no leading (or trailing) N's, it's better for the regex to fail, than the replace an empty string with an empty string.
      Perl --((8:>*
Re^2: Removing Flanking "N"s in a DNA String
by Aristotle (Chancellor) on Nov 07, 2005 at 15:21 UTC

    Note that this is inefficient. The non-greedy quantifier will cause the engine to try matching the trailing N* part whenever it can, so in the case it will match into the middle NNNN part before finding that the end-of-string doesn’t follow and backtracking out of it.

    In very nearly every case where you want to do something at the start of the string and at the end of the string, you should use two anchored substituions.

    And what’s more important is, that’s more readable too.

    Makeshifts last the longest.

      Really, then what is wrong with my benchmark?

      #! perl -slw use strict; use Benchmark qw[cmpthese]; my @tests = map{ my $s = 'N' x $_; ( "${s}S${s}S", "${s}S${s}S${s}", "S${s}S${s}" ) } map{ $_ * 10 } 1 .. 100; @$_ = @tests for \ our( @a, @b, @c, @d ); cmpthese 10, { a=>q[ s[^N*(.*?)N*$][$1] for @a ], b=>q[ s[N*$][] and s[^n*][] for @b ], c=>q[ s[N*$][] and s[^n*][] for @c ], d=>q[ s[^N*(.*?)N*$][$1] for @d ], }; __END__ P:\test>junk Rate b c a d b 6.10/s -- -1% -27% -28% c 6.15/s 1% -- -26% -27% a 8.31/s 36% 35% -- -1% d 8.42/s 38% 37% 1% --

        How about “apples and oranges” or “it’s a really worthless benchmark”?

        #!/usr/bin/perl use strict; use warnings; use Benchmark qw( cmpthese ); sub run_tests { my ( $len_remove, $len_keep, $num_repeat ) = @_; my $remove = 'N' x $len_remove; my $keep = 'O' x $len_keep; my %test = ( front => "$remove$keep" x $num_repeat, tail => "$keep$remove" x $num_repeat, both_ends => "$remove$keep" x $num_repeat . $remove, nothing => "$keep$remove" x $num_repeat . $keep, ); print "$len_remove chars to remove, $len_keep chars long kept sequ +ences, $num_repeat repetitions.\n"; for my $type ( keys %test ) { print "Measuring removing at $type.\n"; cmpthese -2 => { one_sub => sub { for( 1 .. 1000 ) { s{^N*(.*?)N*$}{$1} for + my $copy = $test{$type} } }, two_sub => sub { for( 1 .. 1000 ) { s{^N*}{}, s{N*$}{} for + my $copy = $test{$type} } }, }; } print "\n"; } $|++; run_tests 4, 4, 1; run_tests 20, 20, 1; run_tests 20, 20, 50; run_tests 4, 4, 20; run_tests 4, 12, 10; run_tests 4, 100, 100;

        This gives me:

        As you can see, the two-subst version is always faster. If you don’t believe me, run the thing through use re 'debug'; and watch what the engine is doing.

        Makeshifts last the longest.