Sweating the small stuff: a lesson in optimisation

A lesson for me, that is, which I thought I'd share as a warning to others. :)

I needed to read and write data in a legacy format that, for reasons that may have seemed compelling in the 1970s, stores data "efficiently": data is 7-bit ASCII, and rather than waste a whole bit of every octet, 8 chars are packed into every 7 bytes.

pack can't handle this directly: it only does whole bytes. vec only does powers-of-2. If Bit::Vector can do it then I sure can't figure out how (but the documentation for that module makes my head explode at the best of times).

Fortunately, however, this type of encoding is still widely used in one everyday field: SMS. I know there are some SMS modules on CPAN. A quick search throws up Device::Gsm::Pdu and GSM::SMS::PDU. I look inside.

I didn't much like the look of either, to be honest. They're both full of stuff like

unpack 'b8', pack 'H2', substr $foo, 0, 2

which just screamed "inefficient" to me. I was sure I could reinvent this wheel better.

The naive way to do this is to unpack one's data into a bit string, insert/remove the zeroes in the eighth bit, and then pack the bits back together again:

sub packwithsubst {
    my $string = shift;
    my $bits = unpack 'b*', $string;
    $bits =~ s/(.{7})./$1/g;
    return pack 'Cb*', length($string), $bits;
}

sub unpackwithsubst {
    my ($length, $chars) = unpack 'Cb*', shift;
    $chars =~ s/(.{7})/${1}0/g;
    substr $chars, $length * 8, 8, '';
    return pack 'b*', $chars;
}
[download]

I stuck this with the CPAN code in a little script, and benchmarked it on the first 250-odd characters of "To be or not to be". And sure enough, my code was a lot faster:

               Rate GSM::SMS
GSM::SMS      324/s       --
Device::Gsm   342/s       6%
My_code       514/s      59%
[download]

But this was hardly a fair comparison; both the CPAN versions were going via a wholly unnecessary hexadecimal conversion step. What if I simply tried to improve the existing code? That is to say, used this:

my (%b2c, %c2b);
$b2c{unpack 'b8', $_} = $_ for map chr, 0..255;
$c2b{$_} = unpack 'b7', $_ for map chr, 0..127;
    
sub packdevicegsm2 { # Device::Gsm::Pdu:encode_text7, not going via he
+x
    my @char = split //, shift;
    my($result, $bits);
    map { $bits .= $c2b{$_} } @char; 
    if (my $len = length($bits) % 8) { $bits .= '0' x (8 - $len) }
    $result .= $b2c{ substr $bits, 0, 8, '' } while length $bits;
    return chr(scalar @char) . $result;
}

sub unpackdevicegsm2 { # Device::Gsm::Pdu:decode_text7, not going via 
+hex
    my $text7 = shift;
    my $len = ord substr( $text7, 0, 1, '' );
    my $bits = '';
    $bits .= unpack 'b8', substr $text7, 0, 1, '' while length $text7;
    my $decoded = '';
    $decoded .= pack 'b7', substr $bits, 0, 7, '' while length $bits >
+= 7;
    return substr $decoded, 0, $len;
}
[download]

And, to my surprise:

               Rate GSM::SMS
GSM::SMS      322/s       --
Device::Gsm   345/s       7%
My_code       519/s      61%
Dev::Gsm_opt  519/s      61%
[download]

Apparently loops and substr can be as fast as s///g. That was unexpected. And unwelcome. My pride lay in tatters. What had I achieved?!

Time to optimise.

In true hubristic style, I decided (without profiling) that the problem was probably the s///gs. They're doing a lot of copying. I could avoid them when packing by telling unpack to output only the 7 bits I was interested in; to do so I'd have to treat each byte separately, so I'd need a join, but those are fairly fast. There's no way to do that in reverse, but maybe splitting up the bits into an array would be quicker than inserting all those zeroes?

sub packwitharray {
    my $string = shift;
    return pack 'Cb*', length($string), join('', unpack '(b7)*', $stri
+ng);
}

sub unpackwitharray {
    my ($length, $bits) = unpack 'Cb*', shift;
    my @chars = $bits =~ /.{7}/g;
    $#chars = $length - 1;
    return pack '(b7)*', @chars;
}
[download]

Woo! The packing was golfed down to two lines! And it was faster, too:

               Rate GSM::SMS
GSM::SMS      321/s       --
Device::Gsm   342/s       7%
Dev::Gsm_opt  500/s      56%
Subst/Subst   519/s      62%   # This was called My_code above
Subst/Array   527/s      64%   # i.e. packwithsubst / unpackwitharray
Array/Subst   777/s     142%
Array/Array   808/s     152%
[download]

I was back in the lead! A 150% speed boost over the CPAN code was not to be sniffed at, surely? My code was awesome! It was probably faster than C!

           Rate My_code    C
My_code   792/s      -- -99%
C       77422/s   9670%   --
[download]

And I was enlightened.

Full benchmark listing, as spoiler for convenience (it's 260 lines):

#! /usr/bin/perl

use strict;
use warnings;
use Benchmark 'cmpthese';
use 5.010;

use Inline 'C';

my $sample = <<'END_STRING';
To be or not to be: that is the question:
Whether 'tis nobler in the mind to suffer
The slings and arrows of outrageous fortune
Or to take arms against a sea of troubles
And by opposing, end them?
                           To die: to sleep
No more ...
END_STRING

my %packs = (
    C     => \&c_pack,
    subst => \&packwithsubst,
    array => \&packwitharray,
    gsp   => \&packgsmsms,
    dgp   => \&packdevicegsm,
    dgp2  => \&packdevicegsm2,
);

my %unpacks = (
    C     => \&c_unpack,
    subst => \&unpackwithsubst,
    array => \&unpackwitharray,
    gsp   => \&unpackgsmsms,
    dgp   => \&unpackdevicegsm,
    dgp2  => \&unpackdevicegsm2,
);

my @pairs = (
    ['gsp',   'gsp'  ],
    ['dgp',   'dgp'  ],
    ['dgp2',  'dgp2' ],
    ['subst', 'subst'],
    ['subst', 'array'],
    ['array', 'subst'],
    ['array', 'array'],
    ['C',     'C'    ],
);

# Check that all implementations work and are equivalent.
for my $impl (keys %packs) {
    next if $unpacks{array}->($packs{$impl}->($sample)) eq $sample;
    say "$impl packer is broken";
    say "Expected: ", packwithsubst("foo bar baz quux");
    say "Got:      ", $packs{$impl}->("foo bar baz quux");
    exit 1;
}
for my $impl (keys %unpacks) {
    next if $unpacks{$impl}->($packs{array}->($sample)) eq $sample;
    say "$impl unpacker is broken";
    say "Expected: foo bar baz quux";
    say "Got:      ", $unpacks{$impl}->(packwitharray("foo bar baz quu
+x"));
    exit 1;
}

# Do the benchmark.
cmpthese(-1, {
    map {
        my $p = $packs{$$_[0]};
        my $u = $unpacks{$$_[1]};
        "$$_[0]/$$_[1]" => sub { $u->($p->($sample)) };
    } @pairs
});

sub packwithsubst {
    my $string = shift;
    my $bits = unpack 'b*', $string;
    $bits =~ s/(.{7})./$1/g;
    return pack 'Cb*', length($string), $bits;
}

sub packwitharray {
    my $string = shift;
    return pack 'Cb*', length($string), join('', unpack '(b7)*', $stri
+ng);
}

sub unpackwitharray {
    my ($length, $bits) = unpack 'Cb*', shift;
    my @chars = $bits =~ /.{7}/g;
    $#chars = $length - 1;
    return pack '(b7)*', @chars;
}

sub unpackwithsubst {
    my ($length, $chars) = unpack 'Cb*', shift;
    $chars =~ s/(.{7})/${1}0/g;
    substr $chars, $length * 8, 8, '';
    return pack 'b*', $chars;
}

sub unpackdevicegsm2 { # Device::Gsm::Pdu:decode_text7, not going via 
+hex
    my $text7 = shift;
    my $len = ord substr( $text7, 0, 1, '' );
    my $bits = '';
    while( length $text7 ) {
        $bits .= unpack 'b8', substr $text7, 0, 1, '';
    }
    my $decoded = '';
    while( length $bits >= 7 ) {
        $decoded .= pack 'b7', substr($bits, 0, 7, '');
    }
    return substr $decoded, 0, $len;
}

BEGIN {
    
my (%b2c, %c2b);
$b2c{unpack 'b8', $_} = $_ for map chr, 0..255;
$c2b{$_} = unpack 'b7', $_ for map chr, 0..127;
    
sub packdevicegsm2 { # Device::Gsm::Pdu:encode_text7, not going via he
+x
    my @char = split //, shift;
    my($result, $bits);
    map { $bits .= $c2b{$_} } @char; 
    if( my $len = length($bits) % 8 ) {
        $bits .= '0' x ( 8 - $len );
    }
    while( length $bits ) {
        $result .= $b2c{ substr $bits, 0, 8, '' };
    }
    return chr(scalar @char) . $result;
}
}

sub unpackdevicegsm { # Device::Gsm::Pdu:decode_text7, cosmetic change
+s only
    my $text7 = unpack 'H*', shift;
    my $len = hex substr( $text7, 0, 2 );
    $text7  = substr $text7, 2;
    my $bits = '';
    while( $text7 ) {
        $bits .= unpack 'b8', pack 'H2', substr $text7, 0, 2;
        if( length($text7) > 2 ) {
            $text7 = substr($text7, 2);
        } else {
            $text7 = '';
        }
    }
    my $decoded;
    while( length $bits >= 7 ) {
        $decoded .= pack 'b7', substr($bits, 0, 7);
        $bits = substr $bits, 7;
    }
    return substr $decoded, 0, $len;
}

BEGIN {
    
my (%b2h, %h2b);
$b2h{unpack 'b8', $_} = unpack 'H2', $_ for map chr, 0..255;
$h2b{$_} = unpack 'b7', $_ for map chr, 0..127;
    
sub packdevicegsm { # Device::Gsm::Pdu:encode_text7, cosmetic changes 
+only
    my($result, $bits);
    my @char = split //, $_[0];
    map { $bits .= $h2b{$_} } @char; 
    if( my $len = length($bits) % 8 ) {
        $bits .= '0' x ( 8 - $len );
    }
    while( length $bits ) {
        $result .= $b2h{ substr $bits, 0, 8 };
        $bits = substr $bits, 8;
    }
    return chr(scalar @char) . pack 'H*', $result;
}
}

sub unpackgsmsms { # Based on GSM::SMS::PDU::decode_7bit
    my ($ud) = @_;
    my ($msg,$bits);
    my $cnt = 0;
    my $len = ord substr $ud, 0, 1, '';
    $msg = "";
    my $byte = unpack('b8', substr($ud, 0, 1));
    while (($cnt<length($ud)) && (length($msg)<$len)) {
        $msg .= pack('b7', $byte);
        $byte = substr($byte, 7, length($byte) - 7);
        if ( (length( $byte ) < 7) ) {
            ++$cnt; 
            $byte = $byte.unpack('b8', substr($ud, $cnt, 1));
        }
    }
    return $msg;
}

sub packgsmsms { # Based on GSM::SMS::PDU::encode_7bit
    my ($msg) = @_;
    my ($bits, $ud, $octet);
    foreach (split(//,$msg)) {
        $bits .= unpack('b7', $_);
    }
    while (defined($bits) && (length($bits)>0)) {
        $octet = substr($bits,0,8);
        $ud .= pack("b8", substr($octet."0" x 7, 0, 8));
        $bits = (length($bits) > 8) ? substr($bits, 8) : "";
    }
    return chr(length $msg) . $ud;
}

__DATA__
__C__

#include <stdlib.h>
typedef unsigned char uchr;

SV* c_pack(SV* input)
{
    SV*   rv;
    int   len, ipos;
    uchr *bytes  = (uchr*) SvPV(input, len);
    int   newlen = ceil((len / 8.) * 7) + 1;
    uchr *packed = (uchr*) calloc(newlen, 1);
    int   opos = 1, oset = 0;
    packed[0] = len;
    for (ipos = 0; ipos < len; ++ipos) {
        uchr c1 = bytes[ipos];
        uchr c2 = bytes[ipos + 1];
        if (c1 & 0x80) croak("Data must be 7-bit");
        packed[opos++] = c1 >> oset | c2 << 7 - oset;
        if (++oset == 7) { oset = 0; ++ipos; }
    }
    rv = newSVpv((char*) packed, newlen);
    free(packed);
    return rv;
}

SV* c_unpack(SV* input)
{
    SV*   rv;
    int   len;
    uchr *bytes    = (uchr*) SvPV(input, len);
    int   newlen   = bytes[0];
    uchr *unpacked = (uchr*) calloc(newlen, 1);
    int   ipos, opos = 0, oset = 0;
    uchr  carry = 0;
    for (ipos = 1; ipos < len; ++ipos) {
        uchr ch = bytes[ipos];
        unpacked[opos++] = carry | (ch & (0xff >> oset + 1)) << oset;
        if (oset == 6) {
            unpacked[opos++] = ch >> 1;
            carry = oset = 0;
        }
        else {
            carry = ch >> 7 - oset++;
        }            
    }
    rv = newSVpv((char*) unpacked, newlen);
    free(unpacked);
    return rv;
}
[download]

Comment on Sweating the small stuff: a lesson in optimisation Select or Download Code

Replies are listed 'Best First'.
Re: Sweating the small stuff: a lesson in optimisation by RMGir (Prior) on Mar 16, 2009 at 13:08 UTC
Very entertaining read. I hope everyone reads all the way to the end - the last benchmark cracked me up. The only question you didn't mention was "how fast is fast enough for this application?" If the packing/unpacking time is not a bottleneck using My_code, then you might not need the C version, removing the need for Inline::C. Mike	[reply]
Re^2: Sweating the small stuff: a lesson in optimisation by Porculus (Hermit) on Mar 16, 2009 at 21:40 UTC
That was part of the lesson, really. In practice any of the implementations would be "fast enough". The speed of this function does make a measurable difference to the run-time, but measurable isn't the same as significant; the entire program currently spends maybe two minutes a day running, so even if that was all spent in these routines, it still wouldn't make much difference. (I didn't profile before I started; I went straight to optimisation mode because this particular routine involved Other People's Code and it Looked Inefficient. Hubris again.) I wrote the final version purely out of curiosity, and it put things so firmly in perspective that I just had to write it up.	[reply]
Re: Sweating the small stuff: a lesson in optimisation by Limbic~Region (Chancellor) on Mar 16, 2009 at 15:06 UTC
Porculus, You may find Efficient 7bit compression amusing. Cheers - L~R	[reply]
Re^2: Sweating the small stuff: a lesson in optimisation by Porculus (Hermit) on Mar 16, 2009 at 22:33 UTC
Indeed I do! Thanks for the link. It's fun to see how differently the implementations evolve when the goal is to optimise for footprint over speed. I guess I've supplied the missing runtime-optimised versions, just a few years late. :) (I also observe in passing that the implementations in that thread are all buggy in the general case. They don't take into account that "foo bar" and "foo bar\x00" will pack to identical strings. Not a problem if you know there are no nulls in the input, but that's why the length byte in my data gives the number of packed characters rather than simply the length of the field.)	[reply]