[NOT] How would you decode this?

BrowserUk has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: [NOT] How would you decode this? by Corion (Patriarch) on Dec 28, 2010 at 15:21 UTC
I think my approach would reflect your description quite closely, assuming that you have the data as an array of scalars already: #!perl -w use strict; use Carp qw(croak); use Data::Dumper; sub get_value { my ($values) = @_; my $result = shift @$values; if (($result & 0b10000000) == 0) { # maybe $result < 127 is faster? return $result } elsif (($result & 0b11000000) == 0b10000000) { # maybe $result < 0b11000000 is faster? return (($result & 0b00111111) << 8) + shift @$values; } elsif (($result & 0b11000000) == 0b11000000) { # Assuming that Perl evaluates even "commutative" expressions # from left to right $result = (($result & 0b00111111) << 16) + ((shift @$values) << 8) + ((shift @$values)); return $result; } else { # This should not happen anyway: croak sprintf "Invalid value found in input: %08b", $result; }; }; my @values = (0b01010101, # 85 0b10101010, 0b10101010, # 10922 0b11001100, 0b11001100, 0b11001100, # 838860 0b00000000, # 0 as another simplicistic testcase ); while (@values) { print "$values[0] => ", get_value( \@values ),"\n"; }; [download] As you have far more experience in benchmarking things, my idea of changing the bit-equality operations to size comparisons remains unimplemented. If you have the data not as a stream of integers but as a string, I would use regular expressions to extract the information: #!perl -w use strict; use Carp qw(croak); use Data::Dumper; sub get_value { my ($values) = @_; if ($$values =~ s/^([\x00-\x7F])//) { return ord $1 } elsif ($$values =~ s/^([\x80-\xBF])(.)//) { return (((ord($1) & 0b00111111)<< 8) + ord($2)); } elsif ($$values =~ s/^([\xC0-\xFF])(.)(.)//) { return ( ((ord($1) & 0b00111111) << 16) +(ord($2) << 8) +(ord($3)) ); } else { croak "Invalid wide character in input: %08b", substr $$values +,0,1; }; }; my @values = (0b01010101, # 85 0b10101010, 0b10101010, # 10922 0b11001100, 0b11001100, 0b11001100, # 838860 0b00000000, # 0 as another simplicistic testcase ); my $values = join "", map { chr } @values; while (length $values) { print get_value( \$values ),"\n"; }; [download] Most likely faster is a single invocation of the RE engine with alternation, but that makes decoding what we've got a bit uglier. I left-pad the string with zeroes to 32bit and then unpack it as a number, masking off the relevant bits afterwards: #!perl -w use strict; use Carp qw(croak); use Data::Dumper; sub get_value { my ($values) = @_; if ($$values =~ s/^([\x00-\x7F]\|[\x80-\xBF].\|[\xC0-\xFF]..)//) { my $tmp = substr "\0\0\0$1", -4; # maybe sprintf or pack would + be faster... my $result = unpack 'N', $tmp; if (length $1 == 3) { $result &= 0b00111111_11111111_11111111; } elsif (length $1 == 2) { $result &= 0b00111111_11111111; }; return $result; } else { croak "Invalid wide character in input: %08b", substr $$values +,0,1; }; }; my @values = (0b01010101, # 85 0b10101010, 0b10101010, # 10922 0b11001100, 0b11001100, 0b11001100, # 838860 0b00000000, # 0 as another simplicistic testcase ); my $values = join "", map { chr } @values; while (length $values) { print get_value( \$values ),"\n"; }; [download]	[reply] [d/l] [select]
Re: How would you decode this? (3 steps) by tye (Sage) on Dec 29, 2010 at 00:44 UTC
Same way in Perl or C: `my $c= getc(); return $c if $c < 0x80; $c= getc() + (($c-0x80)<<8); return $c if $c < 0x4000 return getc() + (($c-0x4000)<<8);` [download] In Perl, getc() would be something like: `sub getc { ord( substr($buf,0,1,'') ); }` [download] In C, I'd probably consume the buffer building a normalized buffer that I'd read from Perl via something like `unpack "L*"` (and thus mostly not deal with Perl data structures in C code). And I don't think I'd sweat the few extra arithmetic operations by resorting to `switch(c>>6) { case 0: case 1: return c; case 2: return ...; case 3: return ...; }` - tye	[reply] [d/l] [select]
Re: [NOT] How would you decode this? by jethro (Monsignor) on Dec 28, 2010 at 15:27 UTC
Really fast => C If pure perl it might make sense to decode the string in all possible variations, ie: `@singlebyte= vec $string,0,8; @twobyte0= vec $string,0,16; @twobyte1= vec $string,8,16; ... @threebyte2= vec $string,16,24;` [download] and use appropriate offsets into these arrays (or pad the arrays so that you can use one offset for all). If there is some bias in the input so that most bytes are coded with one of the 3 methods, for example if 98% of all integers were encoded with 3 bytes, even better. You might get away with only one, two, or three of these vec calls and do the decoding of the other variants slowly without impacting the overall performance. There is also a final and-ing neccessary to eliminate the two high bits for the 2-byte and three-byte case, but I'm sure you already thought about this UPDATE: Not only is vec not possible with 24 bits, it also extracts only one value per call instead of handling the whole string. I should learn reading again. Without that the time savings are probably minimal to nonexistant. So something like unpack "n" and unpack "cn" for the 16bit values would be more appropriate. Still doesn't solve the 24-bit case	[reply] [d/l]
Re^2: [NOT] How would you decode this? by BrowserUk (Patriarch) on Dec 28, 2010 at 15:48 UTC
One problem is that vec doesn't handle 24-bit values. Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. "Science is about questioning the status quo. Questioning authority". In the absence of evidence, opinion is indistinguishable from prejudice.	[reply]
Re^3: [NOT] How would you decode this? by jethro (Monsignor) on Dec 28, 2010 at 16:10 UTC
Oh, then the perl man page I consulted is deficient: perldoc -f vec says "This must be a power of two from 1 to 32". But a quick test shows that you are right	[reply]
Re^4: [NOT] How would you decode this? by Corion (Patriarch) on Dec 28, 2010 at 16:11 UTC
Re^5: [NOT] How would you decode this? by jethro (Monsignor) on Dec 28, 2010 at 16:13 UTC
Some notes below your chosen depth have not been shown here
Re^4: [NOT] How would you decode this? by BrowserUk (Patriarch) on Dec 28, 2010 at 16:11 UTC
Re: [NOT] How would you decode this? by talexb (Chancellor) on Dec 28, 2010 at 15:46 UTC
This truly sounds like it's something that would be better suited to a solution in C, rather than in Perl. Of course, your choice may vary depending on operating platform, desired performance, development cost, volume of data and so forth. Alex / talexb / Toronto "Groklaw is the open-source mentality applied to legal research" ~ Linus Torvalds	[reply]
Re: [NOT] How would you decode this? by juster (Friar) on Dec 28, 2010 at 17:22 UTC
Here is my quick stab at this in perl. It seems convenient to use the highest two bits in the first byte as a count. Of course this probably isn't super fast. #!/usr/bin/env perl use warnings; use strict; sub conv_stream { my ($stream_ref) = @_; my $first = shift @$stream_ref; my $count = ( vec $first, 3, 2 ) \|\| 1; # edit: I was using unpack and oct above # edit: No need to overwrite the "count" if first bits are 00 or 0 +1 if ( $count > 1 ) { ( vec $first, 3, 2 ) = 0; } my $next = join q{}, splice @$stream_ref, 0, ($count-1); my $pad = (chr 0) x (4-$count); return unpack 'N', $pad . $first . $next; } # I'm not quite sure I have the right bit ordering... my @stream = map { pack 'B8', $_ } ( '01010101', # 85 '10101010', '10101010', # 10922 '11001100', '11001100', '11001100' ); # 838860 print conv_stream( \@stream ), "\n" while ( @stream ); [download]	[reply] [d/l]
Re: [NOT] How would you decode this? by roboticus (Chancellor) on Dec 29, 2010 at 00:47 UTC
I just read the thread, and someone mentioned C, so here goes: $ cat 879430.cpp #include <stdio.h> int decode(unsigned char buf, int buflen) { static int retval[201]; int tmp=0; int mask=0; int outbuf=retval; unsigned char end=buf+buflen; while (buf < end) { int t=buf & 0xc0; if (t==0xc0) { tmp =buf++<<16; tmp\|=buf++<<8; tmp\|=buf++; mask=0x3fffff; } else if (t==0x80) { tmp =buf++<<8; tmp\|=buf++; mask=0x3fff; } else { tmp =buf++; mask=0x7f; } outbuf++ = tmp&mask; } outbuf++ = -1; // sentinel value: EOList return retval; } unsigned char foo[] = { 0x55, // 85 0xaa, 0xaa, // 10922 0xcc, 0xcc, 0xcc, // 838860 0 }; int main( void ) { int t = decode(foo, sizeof(foo)); while (t != -1) { printf("0x%06x %7u\n", t, t); t++; } return 1; } $ gcc 879430.cpp -lstdc++ $ ./a.out 0x000055 85 0x002aaa 10922 0x0ccccc 838860 0x000000 0 [download] ...roboticus When your only tool is a hammer, all problems look like your thumb.*	[reply] [d/l]
Re: [NOT] How would you decode this? by johngg (Canon) on Dec 28, 2010 at 23:51 UTC
I've used getc to read the byte stream and vec to test the first and second bits of the byte. Depending on the result of the tests I either just unpack an unsigned char or append one or two more `getc()` operations, mask off the first two bits and then either `unpack()` a network short (2 byte piece) or a left-padded with 8 0-bits network long (3 byte piece). I have not incorporated error checking for incomplete reads. use strict; use warnings; use 5.010; my $mask2 = pack q{CC}, 0b00111111, 0b11111111; my $mask3 = pack q{CCC}, 0b00111111, 0b11111111, 0b11111111; my @packed = ( pack( q{C}, 0b01010101 ), pack( q{CC}, 0b10101010, 0b10101010 ), pack( q{CCC}, 0b11001100, 0b11001100, 0b11001100 ), ); my $byteStream = ( join q{}, @packed ) x 3; open my $byteStreamFH, q{<}, \ $byteStream or die $!; while ( defined( my $piece = getc( $byteStreamFH ) ) ) { unless ( vec $piece, 7, 1 ) { # 1 byte piece # say unpack q{C}, $piece; } elsif ( vec $piece, 6, 1 ) { # 3 byte piece # $piece .= getc( $byteStreamFH ) for 1 .. 2; $piece &= $mask3; say unpack q{N}, pack( q{C}, 0 ) . $piece; } else { # 2 byte piece # $piece .= getc( $byteStreamFH ); $piece &= $mask2; say unpack q{n}, $piece; } } close $byteStreamFH or die $!; [download] The output `85 10922 838860 85 10922 838860 85 10922 838860` [download] I don't know how quick this would be on a large volume of data but I would imagine that a solution in C would be somewhat faster. I hope this is of interest. Cheers, JohnGG	[reply] [d/l] [select]
Re: [BUK] How would you decode this? by LanX (Saint) on Dec 28, 2010 at 16:11 UTC
> Thoughts? In Perl this should be quite easily and quickly done with a state-machine operating on lookup tables per byte... Well as long as the `[NOT]` in the title wasn't meant to flag that it's not a Perl question!(?) Cheers Rolf	[reply] [d/l]
Re^2: [BUK] How would you decode this? by planetscape (Chancellor) on Dec 29, 2010 at 12:12 UTC
Well as long as the NOT in the title wasn't meant to flag that it's not a Perl question!(?) I believe the OP meant "Not OT". HTH, planetscape	[reply]