Re: Regex related question

Several people have commented in followups as well as messages to me that the various approaches this thread accumulated ought to be benchmarked. I already commented that I would probably be disinclined to care about how fast the various approaches are, given there's nothing particularly exciting going on (we're not dealing with traversing data structures, sorting, and so on). But given the context of what looks like DNA-munging, I do know that such tasks have a tendency to be brute-force expensive.

Of course without the context of seeing our little snippets of code within a greater application we have no way of knowing how computationally important our code suggestions will be. Profiling is the only way to know if any of this is worthwhile or not. With that in mind I decided to have a little fun with Benchmark. And lately I've been trying to also take on opportunities to put Test::More through its paces too. So here is the wall of code that is a benchmark of all of the alternatives posted to this thread:

use strict;
use warnings;
use v5.12;
use Benchmark qw/cmpthese/;
use Test::More tests => 32;

my $compare_iterations = 75000; # for Benchmark

my @data = qw/
    ACTGCTAGGGGGGG
    TCAGCTAGCNA
    ACTGSCGACAAAA
    GTCTGAGTTATTT
/;

my %answer_key;
@answer_key{ @data } = qw/
    ACTGCTAGG
    TCAGCTAGCNA
    ACTGSCGACAA
    GTCTGAGTTATT
/;

# Names of subs to be tested and benchmarked
my @subnames = qw/
    perlbotics_re
    perlbotics_c
    happybarney_re
    davido_c
    marshall_c
    javafan_re
/;

note "Test whether all benchmark subs exist.\n\n";
can_ok( 'main', @subnames );

note "\nTest whether all benchmark subs return expected values.\n\n";
foreach my $subname ( @subnames ) {
    foreach my $line ( @data ) {
        no strict 'refs';
        cmp_ok( 
            $subname->($line), # Warning: Symbolic ref.
            'eq', 
            $answer_key{$line}, 
            "$subname($line)"
        );
    }
    note "\n";
}

# Test that our Benchmark sub wrapper hands us good subrefs.
{
    my @subrefs = wrap_subs( \@subnames, \@data );

    note "Does wrap_subs() work?\n\n";
    
    like( 
        $subrefs[0],
        qr/CODE/,
        "wrap_subs() returns coderefs"
    );
    
    note "\n";
    
    foreach my $idx ( 0 .. $#subnames ) {
        is_deeply( 
            [ $subrefs[$idx]->() ],
            [ @answer_key{@data} ],
            "$subnames[$idx]() behaves ok as a wrapped subref."
        );
    }

}

note "\nDone testing.\n\n";
# End testing.

# Start benchmarking.

note "Benchmarking.\n\n";

my $bench_hash;
@{$bench_hash}{ @subnames } = wrap_subs( \@subnames, \@data );

# We did all this work for one line, here:
cmpthese( $compare_iterations, $bench_hash );


note "\nEnd of benchmark.\n";

# Wrapper manufacturer for Benchmark.  Accepts an arrayref
# of subnames and an arrayref of test data.  Returns a list
# of subrefs for our test subs, each wrapped in a calling 
# environment that passes in the data set and iterates over it.
sub wrap_subs {
    my ( $subnames, $data ) = @_;
    my @test_subs;
    foreach my $subname ( @{$subnames} ) {
        no strict 'refs';
        push( 
            @test_subs, 
            sub {
                my @results;
                foreach my $line ( @{$data} ) {
                    # Warning: symbolic ref in $subname.
                    push @results, $subname->($line);
                }
                return @results;
            }
        );
    }
    return @test_subs;
}


# These are the subs we're benchmarking.
# Naming convention: author_style, where 're' is a
# regexp approach, and 'c' is a functional approach.

sub perlbotics_re {
    my $line = shift;
    $line =~ s/(\w)\1{2,}\Z/$1$1/;
    return $line;
}

sub perlbotics_c {
    my $line = shift;
    my $last = substr $line, -1;
    if( length $line > 2 and substr( $line, -2, 1 ) eq $last ) {
        chop $line while substr( $line, -1 ) eq $last;
        $line .= $last . $last;
    }
    return $line;
}

sub happybarney_re {
    my $line = shift;
    $line =~ s/(\w)\1*(?=\1\1\Z)//;
    return $line;
}

sub davido_c {
    my $line = shift;
    my $pos = length( $line ) - 1;
    my $find = substr( $line, $pos, 1 );
    $pos-- while substr( $line, $pos, 1 ) eq $find;
    substr( $line, $pos + 3, length( $line ) - ( $pos + 3 ), '' )
        if length( $line ) - $pos > 3;
    return $line
}

sub marshall_c {
    my $line = shift;
    my $last_char = substr( $line, -1, 1 );
    my $cur_index = -1;
    while( substr( $line, --$cur_index, 1 ) eq $last_char) {}
    substr( $line, $cur_index+1, -$cur_index-3, '') if $cur_index < 3;
    return $line;
}

sub javafan_re {
    my $line = shift;
    $line =~ s/(.)\1\K(\1+)$//;
    return $line;
}
[download]

On my system that produces the following mass of output (along with the benchmark results):

1..32
# Test whether all benchmark subs exist.
# 
ok 1 - main->can(...)
# 
# Test whether all benchmark subs return expected values.
# 
ok 2 - perlbotics_re(ACTGCTAGGGGGGG)
ok 3 - perlbotics_re(TCAGCTAGCNA)
ok 4 - perlbotics_re(ACTGSCGACAAAA)
ok 5 - perlbotics_re(GTCTGAGTTATTT)
# 
ok 6 - perlbotics_c(ACTGCTAGGGGGGG)
ok 7 - perlbotics_c(TCAGCTAGCNA)
ok 8 - perlbotics_c(ACTGSCGACAAAA)
ok 9 - perlbotics_c(GTCTGAGTTATTT)
# 
ok 10 - happybarney_re(ACTGCTAGGGGGGG)
ok 11 - happybarney_re(TCAGCTAGCNA)
ok 12 - happybarney_re(ACTGSCGACAAAA)
ok 13 - happybarney_re(GTCTGAGTTATTT)
# 
ok 14 - davido_c(ACTGCTAGGGGGGG)
ok 15 - davido_c(TCAGCTAGCNA)
ok 16 - davido_c(ACTGSCGACAAAA)
ok 17 - davido_c(GTCTGAGTTATTT)
# 
ok 18 - marshall_c(ACTGCTAGGGGGGG)
ok 19 - marshall_c(TCAGCTAGCNA)
ok 20 - marshall_c(ACTGSCGACAAAA)
ok 21 - marshall_c(GTCTGAGTTATTT)
# 
ok 22 - javafan_re(ACTGCTAGGGGGGG)
ok 23 - javafan_re(TCAGCTAGCNA)
ok 24 - javafan_re(ACTGSCGACAAAA)
ok 25 - javafan_re(GTCTGAGTTATTT)
# 
# Does wrap_subs() work?
# 
ok 26 - wrap_subs() returns coderefs
# 
ok 27 - perlbotics_re() behaves ok as a wrapped subref.
ok 28 - perlbotics_c() behaves ok as a wrapped subref.
ok 29 - happybarney_re() behaves ok as a wrapped subref.
ok 30 - davido_c() behaves ok as a wrapped subref.
ok 31 - marshall_c() behaves ok as a wrapped subref.
ok 32 - javafan_re() behaves ok as a wrapped subref.
# 
# Done testing.
# 
# Benchmarking.
# 
                  Rate perlbotics_re happybarney_re javafan_re davido_c perlbotics_c marshall_c
perlbotics_re  26709/s            --           -17%       -27%     -51%         -54%       -58%
happybarney_re 32258/s           21%             --       -12%     -41%         -44%       -50%
javafan_re     36711/s           37%            14%         --     -33%         -37%       -43%
davido_c       54665/s          105%            69%        49%       --          -6%       -15%
perlbotics_c   57915/s          117%            80%        58%       6%           --       -10%
marshall_c     64157/s          140%            99%        75%      17%          11%         --
# 
# End of benchmark.

It comes as no real surprise that the regular expression-based solutions produced the least amount of code. And frankly I think they also produced the most readable code. Without walking through the details of each regexp, at least we can look at them and know immediately there's some substitution going on at the end of the string.

It also comes as little surprise that the substr solutions were all faster.

To put it all in context though, none of these is an order of magnitude faster than the other. Sure, the best substr method is more than twice as fast as the worst regexp approach. But where benchmarking really gets interesting (in my opinion) is when it leads to algorithms that move up the Big-O ladder toward O(1).

The moral: Profile. And until profiling points out that the clearest solution is contributing to a bottleneck, don't worry. If it is contributing to a bottleneck, break out the old fashioned substr tools.

Oh, I did want to mention: I've been playing with abstracting the creation of Benchmarks, and that's why this code ends up with a lot of "higher order" craziness in generating sub wrappers. I hope that part is actually the most interesting aspect of this post. And I hope it encourages people to look into testing if they haven't already. It actually made this code easier to keep from getting unruly.

Dave

Comment on Re: Regex related question Download Code