BrowserUk has asked for the wisdom of the Perl Monks concerning the following question:

The input is a byte stream. The required output is a list or array of integers decoded from it.

The highest two bits in first byte of each integer value in the input stream decide how long the value is:

  1. If the highest bit of the first byte is 0;

    then the output is the numeric value of the other 7-bits in that byte.

    01010101 == 85
  2. If the highest bit is 1; and the second highest bit is 0;

    The output is the numeric value of lowest 14-bits of the first and second bytes treated as a big-endian short.

    10101010_10101010 == 10922
  3. If the highest two bits of the first byte are 1;

    Then the output is the numeric value of the lowest 22-bits of the first three bytes of input (big-endian).

    11001100_11001100_11001100 == 838860

Thoughts?


Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.

Replies are listed 'Best First'.
Re: [NOT] How would you decode this?
by Corion (Patriarch) on Dec 28, 2010 at 15:21 UTC

    I think my approach would reflect your description quite closely, assuming that you have the data as an array of scalars already:

    #!perl -w use strict; use Carp qw(croak); use Data::Dumper; sub get_value { my ($values) = @_; my $result = shift @$values; if (($result & 0b10000000) == 0) { # maybe $result < 127 is faster? return $result } elsif (($result & 0b11000000) == 0b10000000) { # maybe $result < 0b11000000 is faster? return (($result & 0b00111111) << 8) + shift @$values; } elsif (($result & 0b11000000) == 0b11000000) { # Assuming that Perl evaluates even "commutative" expressions # from left to right $result = (($result & 0b00111111) << 16) + ((shift @$values) << 8) + ((shift @$values)); return $result; } else { # This should not happen anyway: croak sprintf "Invalid value found in input: %08b", $result; }; }; my @values = (0b01010101, # 85 0b10101010, 0b10101010, # 10922 0b11001100, 0b11001100, 0b11001100, # 838860 0b00000000, # 0 as another simplicistic testcase ); while (@values) { print "$values[0] => ", get_value( \@values ),"\n"; };

    As you have far more experience in benchmarking things, my idea of changing the bit-equality operations to size comparisons remains unimplemented.

    If you have the data not as a stream of integers but as a string, I would use regular expressions to extract the information:

    #!perl -w use strict; use Carp qw(croak); use Data::Dumper; sub get_value { my ($values) = @_; if ($$values =~ s/^([\x00-\x7F])//) { return ord $1 } elsif ($$values =~ s/^([\x80-\xBF])(.)//) { return (((ord($1) & 0b00111111)<< 8) + ord($2)); } elsif ($$values =~ s/^([\xC0-\xFF])(.)(.)//) { return ( ((ord($1) & 0b00111111) << 16) +(ord($2) << 8) +(ord($3)) ); } else { croak "Invalid wide character in input: %08b", substr $$values +,0,1; }; }; my @values = (0b01010101, # 85 0b10101010, 0b10101010, # 10922 0b11001100, 0b11001100, 0b11001100, # 838860 0b00000000, # 0 as another simplicistic testcase ); my $values = join "", map { chr } @values; while (length $values) { print get_value( \$values ),"\n"; };

    Most likely faster is a single invocation of the RE engine with alternation, but that makes decoding what we've got a bit uglier. I left-pad the string with zeroes to 32bit and then unpack it as a number, masking off the relevant bits afterwards:

    #!perl -w use strict; use Carp qw(croak); use Data::Dumper; sub get_value { my ($values) = @_; if ($$values =~ s/^([\x00-\x7F]|[\x80-\xBF].|[\xC0-\xFF]..)//) { my $tmp = substr "\0\0\0$1", -4; # maybe sprintf or pack would + be faster... my $result = unpack 'N', $tmp; if (length $1 == 3) { $result &= 0b00111111_11111111_11111111; } elsif (length $1 == 2) { $result &= 0b00111111_11111111; }; return $result; } else { croak "Invalid wide character in input: %08b", substr $$values +,0,1; }; }; my @values = (0b01010101, # 85 0b10101010, 0b10101010, # 10922 0b11001100, 0b11001100, 0b11001100, # 838860 0b00000000, # 0 as another simplicistic testcase ); my $values = join "", map { chr } @values; while (length $values) { print get_value( \$values ),"\n"; };
Re: How would you decode this? (3 steps)
by tye (Sage) on Dec 29, 2010 at 00:44 UTC

    Same way in Perl or C:

    my $c= getc(); return $c if $c < 0x80; $c= getc() + (($c-0x80)<<8); return $c if $c < 0x4000 return getc() + (($c-0x4000)<<8);

    In Perl, getc() would be something like:

    sub getc { ord( substr($buf,0,1,'') ); }

    In C, I'd probably consume the buffer building a normalized buffer that I'd read from Perl via something like unpack "L*" (and thus mostly not deal with Perl data structures in C code). And I don't think I'd sweat the few extra arithmetic operations by resorting to switch(c>>6) { case 0: case 1: return c; case 2: return ...; case 3: return ...; }

    - tye        

Re: [NOT] How would you decode this?
by jethro (Monsignor) on Dec 28, 2010 at 15:27 UTC

    Really fast => C

    If pure perl it might make sense to decode the string in all possible variations, ie:

    @singlebyte= vec $string,0,8; @twobyte0= vec $string,0,16; @twobyte1= vec $string,8,16; ... @threebyte2= vec $string,16,24;

    and use appropriate offsets into these arrays (or pad the arrays so that you can use one offset for all).

    If there is some bias in the input so that most bytes are coded with one of the 3 methods, for example if 98% of all integers were encoded with 3 bytes, even better. You might get away with only one, two, or three of these vec calls and do the decoding of the other variants slowly without impacting the overall performance.

    There is also a final and-ing neccessary to eliminate the two high bits for the 2-byte and three-byte case, but I'm sure you already thought about this

    UPDATE: Not only is vec not possible with 24 bits, it also extracts only one value per call instead of handling the whole string. I should learn reading again. Without that the time savings are probably minimal to nonexistant. So something like unpack "n*" and unpack "cn*" for the 16bit values would be more appropriate. Still doesn't solve the 24-bit case

      One problem is that vec doesn't handle 24-bit values.


      Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
      "Science is about questioning the status quo. Questioning authority".
      In the absence of evidence, opinion is indistinguishable from prejudice.
        Oh, then the perl man page I consulted is deficient: perldoc -f vec says "This must be a power of two from 1 to 32". But a quick test shows that you are right
Re: [NOT] How would you decode this?
by talexb (Chancellor) on Dec 28, 2010 at 15:46 UTC

    This truly sounds like it's something that would be better suited to a solution in C, rather than in Perl.

    Of course, your choice may vary depending on operating platform, desired performance, development cost, volume of data and so forth.

    Alex / talexb / Toronto

    "Groklaw is the open-source mentality applied to legal research" ~ Linus Torvalds

Re: [NOT] How would you decode this?
by juster (Friar) on Dec 28, 2010 at 17:22 UTC
    Here is my quick stab at this in perl. It seems convenient to use the highest two bits in the first byte as a count. Of course this probably isn't super fast.
    #!/usr/bin/env perl use warnings; use strict; sub conv_stream { my ($stream_ref) = @_; my $first = shift @$stream_ref; my $count = ( vec $first, 3, 2 ) || 1; # edit: I was using unpack and oct above # edit: No need to overwrite the "count" if first bits are 00 or 0 +1 if ( $count > 1 ) { ( vec $first, 3, 2 ) = 0; } my $next = join q{}, splice @$stream_ref, 0, ($count-1); my $pad = (chr 0) x (4-$count); return unpack 'N', $pad . $first . $next; } # I'm not quite sure I have the right bit ordering... my @stream = map { pack 'B8', $_ } ( '01010101', # 85 '10101010', '10101010', # 10922 '11001100', '11001100', '11001100' ); # 838860 print conv_stream( \@stream ), "\n" while ( @stream );
Re: [NOT] How would you decode this?
by roboticus (Chancellor) on Dec 29, 2010 at 00:47 UTC

    I just read the thread, and someone mentioned C, so here goes:

    $ cat 879430.cpp #include <stdio.h> int *decode(unsigned char *buf, int buflen) { static int retval[201]; int tmp=0; int mask=0; int *outbuf=retval; unsigned char *end=buf+buflen; while (buf < end) { int t=*buf & 0xc0; if (t==0xc0) { tmp =*buf++<<16; tmp|=*buf++<<8; tmp|=*buf++; mask=0x3fffff; } else if (t==0x80) { tmp =*buf++<<8; tmp|=*buf++; mask=0x3fff; } else { tmp =*buf++; mask=0x7f; } *outbuf++ = tmp&mask; } *outbuf++ = -1; // sentinel value: EOList return retval; } unsigned char foo[] = { 0x55, // 85 0xaa, 0xaa, // 10922 0xcc, 0xcc, 0xcc, // 838860 0 }; int main( void ) { int *t = decode(foo, sizeof(foo)); while (*t != -1) { printf("0x%06x %7u\n", *t, *t); t++; } return 1; } $ gcc 879430.cpp -lstdc++ $ ./a.out 0x000055 85 0x002aaa 10922 0x0ccccc 838860 0x000000 0

    ...roboticus

    When your only tool is a hammer, all problems look like your thumb.

Re: [NOT] How would you decode this?
by johngg (Canon) on Dec 28, 2010 at 23:51 UTC

    I've used getc to read the byte stream and vec to test the first and second bits of the byte. Depending on the result of the tests I either just unpack an unsigned char or append one or two more getc() operations, mask off the first two bits and then either unpack() a network short (2 byte piece) or a left-padded with 8 0-bits network long (3 byte piece). I have not incorporated error checking for incomplete reads.

    use strict; use warnings; use 5.010; my $mask2 = pack q{CC}, 0b00111111, 0b11111111; my $mask3 = pack q{CCC}, 0b00111111, 0b11111111, 0b11111111; my @packed = ( pack( q{C}, 0b01010101 ), pack( q{CC}, 0b10101010, 0b10101010 ), pack( q{CCC}, 0b11001100, 0b11001100, 0b11001100 ), ); my $byteStream = ( join q{}, @packed ) x 3; open my $byteStreamFH, q{<}, \ $byteStream or die $!; while ( defined( my $piece = getc( $byteStreamFH ) ) ) { unless ( vec $piece, 7, 1 ) { # 1 byte piece # say unpack q{C}, $piece; } elsif ( vec $piece, 6, 1 ) { # 3 byte piece # $piece .= getc( $byteStreamFH ) for 1 .. 2; $piece &= $mask3; say unpack q{N}, pack( q{C}, 0 ) . $piece; } else { # 2 byte piece # $piece .= getc( $byteStreamFH ); $piece &= $mask2; say unpack q{n}, $piece; } } close $byteStreamFH or die $!;

    The output

    85 10922 838860 85 10922 838860 85 10922 838860

    I don't know how quick this would be on a large volume of data but I would imagine that a solution in C would be somewhat faster.

    I hope this is of interest.

    Cheers,

    JohnGG

Re: [BUK] How would you decode this?
by LanX (Saint) on Dec 28, 2010 at 16:11 UTC
    > Thoughts?

    In Perl this should be quite easily and quickly done with a state-machine operating on lookup tables per byte...

    Well as long as the [NOT] in the title wasn't meant to flag that it's not a Perl question!(?)

    Cheers Rolf

      Well as long as the NOT in the title wasn't meant to flag that it's not a Perl question!(?)

      I believe the OP meant "Not OT".

      HTH,

      planetscape