in reply to Re: Removing Flanking "N"s in a DNA String
in thread Removing Flanking "N"s in a DNA String

Note that this is inefficient. The non-greedy quantifier will cause the engine to try matching the trailing N* part whenever it can, so in the case it will match into the middle NNNN part before finding that the end-of-string doesn’t follow and backtracking out of it.

In very nearly every case where you want to do something at the start of the string and at the end of the string, you should use two anchored substituions.

And what’s more important is, that’s more readable too.

Makeshifts last the longest.

  • Comment on Re^2: Removing Flanking "N"s in a DNA String

Replies are listed 'Best First'.
Re^3: Removing Flanking "N"s in a DNA String
by BrowserUk (Patriarch) on Nov 07, 2005 at 15:24 UTC

    Really, then what is wrong with my benchmark?

    #! perl -slw use strict; use Benchmark qw[cmpthese]; my @tests = map{ my $s = 'N' x $_; ( "${s}S${s}S", "${s}S${s}S${s}", "S${s}S${s}" ) } map{ $_ * 10 } 1 .. 100; @$_ = @tests for \ our( @a, @b, @c, @d ); cmpthese 10, { a=>q[ s[^N*(.*?)N*$][$1] for @a ], b=>q[ s[N*$][] and s[^n*][] for @b ], c=>q[ s[N*$][] and s[^n*][] for @c ], d=>q[ s[^N*(.*?)N*$][$1] for @d ], }; __END__ P:\test>junk Rate b c a d b 6.10/s -- -1% -27% -28% c 6.15/s 1% -- -26% -27% a 8.31/s 36% 35% -- -1% d 8.42/s 38% 37% 1% --

      How about “apples and oranges” or “it’s a really worthless benchmark”?

      #!/usr/bin/perl use strict; use warnings; use Benchmark qw( cmpthese ); sub run_tests { my ( $len_remove, $len_keep, $num_repeat ) = @_; my $remove = 'N' x $len_remove; my $keep = 'O' x $len_keep; my %test = ( front => "$remove$keep" x $num_repeat, tail => "$keep$remove" x $num_repeat, both_ends => "$remove$keep" x $num_repeat . $remove, nothing => "$keep$remove" x $num_repeat . $keep, ); print "$len_remove chars to remove, $len_keep chars long kept sequ +ences, $num_repeat repetitions.\n"; for my $type ( keys %test ) { print "Measuring removing at $type.\n"; cmpthese -2 => { one_sub => sub { for( 1 .. 1000 ) { s{^N*(.*?)N*$}{$1} for + my $copy = $test{$type} } }, two_sub => sub { for( 1 .. 1000 ) { s{^N*}{}, s{N*$}{} for + my $copy = $test{$type} } }, }; } print "\n"; } $|++; run_tests 4, 4, 1; run_tests 20, 20, 1; run_tests 20, 20, 50; run_tests 4, 4, 20; run_tests 4, 12, 10; run_tests 4, 100, 100;

      This gives me:

      As you can see, the two-subst version is always faster. If you don’t believe me, run the thing through use re 'debug'; and watch what the engine is doing.

      Makeshifts last the longest.

        There are two problems with your benchmark.

        1. The time taken to copy the data to modify within the code being benchmarked, drowns the minscule time spent do so.
        2. Your benchmark does not account for Benchmark.pms habit of preferencial biasing the tests in favour of the first case run.

          In the following, the only change I have made to your benchmark is to reverse the naming of the cases so tha they will be run in teh reverse order. Note how in most cases the "winner" is reversed, and in the few where this not the case, the differences are within the bounds of experimental error:


          Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
          Lingua non convalesco, consenesco et abolesco. -- Rule 1 has a caveat! -- Who broke the cabal?
          "Science is about questioning the status quo. Questioning authority".
          In the absence of evidence, opinion is indistinguishable from prejudice.