pack unpack charcount repetition

denthijs has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.

Re: pack unpack charcount repetition
by BrowserUk (Patriarch) on Jun 09, 2003 at 11:34 UTC

Update:Removed errent print from code as pointed out by cciulla below.

Also, this was intended as slightly tongue in cheek answer in response to the (golf) nomenclature in the (original) title. I apologise for omiting the smiley. See my addendom below by way of recompemse.

</update>

How about 10:1 compression?

use Devel::Size qw[total_size];

my @a = ( 0,54,28,76,126,0,28,54,62,54,0,28,54,60,48,0,54,54,62,54,0 )
+;

my $packed = pack 'C*', @a; 

print 'Memory requirement of ', total_size(\@a)
    , ' bytes is reduced to ', total_size(\$packed), ' bytes';

Memory requirement of 476 bytes is reduced to 46 bytes.
[download]

Addendum

There are two basic was of compressing character data.

Bitwise reduction (Theres probably a better term). In this you reduce the storage requirement by using less than 8-bit per character. For example. If you only needed to represent uppercase alpha then you could get away with 5-bits/char so you could get a 3/8 ths reduction by packing them into a bitstream

But as you have 21 bytes and a range of values 0-128, you would at best be saving 1-bit per byte. 21-bits saves 2-bytes! Hardly worth the effort.

Then there is the dictionary method that you tried yourself. In this, you build a dictionary of the common bytes (actually strings of common byte sequences work better, but your sample data is too short and varied for this to work well), and then represent the bytes by indexes into the dictionary. The problem as you saw is that representing 1 byte by another, Plus the dictionary, makes it worse rather than better. However, if you then use the first technique to reduce the storage requirement of the indices, then you get somewhere.

Your dictionary has 8 entries [0,28,48,54,60,62,76,126 ] (which is unfortunate. If it where only 7, the compression would be greater). That means you need 4 bits/byte for the indices.

Update 2 I was having a bad day. 8 values can be indexed with 3 bits! So 21*3/8 gives 8 bytes not 11, so a total of 16 bytes is all that is required. In addition 1-bit per byte if the dictionary could be shed saving another byte.

my @ind = (0,3,1,6,7,0,1,3,5,3,0,1,3,4,2,0,3,3,5,3,0);
my $ind = '';
vec( $ind, $_, 4) = $ind[$_] for 0 .. $#ind;

print length $ind; #print 11
[download]

This has allowed you to pack the 21 indicies into 11 bytes. But now you need to concatenate that with the dictionary of 8 bytes, and you are back to 19 bytes!

If your data contains any common sequences of bytes then you can store multi-byte sequences in the dictionary and represent them with a single index and possibly get a greater saving. A cursory inspection show two such sequences, of two repetitions each 0, 54 and 54, 0 both appear twice. which mean that you could reduce the number of indices by 2, saving 1 byte. But the dictionary would have to grow by 4 bytes to do it. So that doesn't help much either.

If your range of chars was less, or the there were more data and a greater chance common sequences, then you might get better results. Unfortunately, your dataset is such that it could almost have been purposely chosen to be uncompressable:)

Examine what is said, not who speaks.

"Efficiency is intelligent laziness." -David Dunham
"When I'm working on a problem, I never think about beauty. I think only how to solve the problem. But when I have finished, if the solution is not beautiful, I know it is wrong." -Richard Buckminster Fuller

[reply]
[d/l]
[select]

Re: Re: pack unpack charcount repetition

by cciulla (Friar) on Jun 09, 2003 at 12:15 UTC

++BrowserUK.

However, the print statement in

my $packed = print pack 'C*', @a;
[download]

my $packed = pack 'C*', @a;
[download]

my @a = ( 0,54,28,76,126,0,28,54,62,54,0,28,54,60,48,0,54,54,62,54,0 )
+;
my $packed = pack 'C*', @a;
my (@unpacked) = unpack 'C*', $packed;

print "$packed\n";
print join ',', @unpacked;
[download]

C²

[reply]
[d/l]
[select]

Re: Re: pack unpack charcount repetition

by denthijs (Acolyte) on Jun 09, 2003 at 12:17 UTC

yes!

(0,54,28,76,126,0,28,54,62,54,0,28,54,60,48,0,54,54,62,54,0);
unpack'C*',unpack'u*',"5`#8<3'X`'#8^-@`<-CPP`#8V/C8`";
[download]

[reply]
[d/l]

Re: Re: pack unpack charcount repetition

by zby (Vicar) on Jun 09, 2003 at 12:31 UTC

You count the memory requirement of an uninitialized variable. Beside that, the guy asked about source code reduction - not computer memory.

[reply]

Re: Re: Re: pack unpack charcount repetition

by BrowserUk (Patriarch) on Jun 09, 2003 at 12:51 UTC

Sorry! I let my stupid sense of humour get the better of me. I've updated the post to give a sensible answer also.

Examine what is said, not who speaks.

[reply]

Re: Re: Re: Re: pack unpack charcount repetition

by zby (Vicar) on Jun 09, 2003 at 13:01 UTC

Re: Re: pack unpack charcount repetition

by denthijs (Acolyte) on Jun 09, 2003 at 18:53 UTC

[reply]

Re: pack unpack charcount repetition
by waswas-fng (Curate) on Jun 09, 2003 at 11:21 UTC

$number[0]

$number[1]

compress

[reply]
[d/l]
[select]

Re: Re: pack unpack charcount repetition

by BrowserUk (Patriarch) on Jun 09, 2003 at 11:44 UTC

How can you fit 0 .. 128 into 4-bits?

Examine what is said, not who speaks.

[reply]

Re: Re: pack unpack charcount repetition

by meredith (Friar) on Jun 09, 2003 at 12:37 UTC

bit position	place value
1	1
2	2
3	4
4	8
5	16
6	32
7	64
8	128

freeze

thaw

Storable

Update:

mhoward - at - hattmoward.org

[reply]

Re: Re: pack unpack charcount repetition

by denthijs (Acolyte) on Jun 09, 2003 at 11:52 UTC

>Seeing as all are below 128, you could fit two numbers per byte and save 50% space off the top. the first four bitxs would be $number[0] next four bits would be $number[1] etc.

to expand just parse the bits and form the array

just parse the bits and form the array,....

[reply]
[d/l]
[select]

Re: Re: Re: pack unpack charcount repetition

by cciulla (Friar) on Jun 09, 2003 at 12:10 UTC

It ain't gonna happen. The maximum value you can store in four bits (a nibble) is decimal 15.

Unless, of course, waswas-fng has access to a computer that uses a quantum bit that has more states than "off" and "on".

C²

[reply]

Re: Re: pack unpack charcount repetition

by waswas-fng (Curate) on Jun 09, 2003 at 23:00 UTC

(0,28,48,54,60,62,76,126) = (0000,1000,0100,1100,0010,1010,1110,0001)

[0000,1100,1000,1110,0001,0000,1000,1100,1010,1100,0000,1000,1100,0010
+,0100,0000,1100,1100,1010,1100,0000]
[download]

0000110010001110000100001000110010101100000010001100001001000000110011
+00101011000000
[download]

0,54,28,76,126,0,28,54,62,54,0,28,54,60,48,0,54,54,62,54,0
[download]

[reply]
[d/l]
[select]

Re: Re: Re: pack unpack charcount repetition

by denthijs (Acolyte) on Jun 10, 2003 at 09:23 UTC

yaph

248131

Yet Another Perl Hacker

[reply]

RLE for simple array compression
by wufnik (Friar) on Jun 09, 2003 at 17:02 UTC

@test = (23,23,4,8,21,90,90,90,90,2,2,2,19,21,19);
map {
    $length = ($test[$_ - 1] == $last)? $length + 1: 1;
    $run++ unless  $test[$_ - 1] == $last;
    $last = $test[$_ - 1];
    $runlengths[$run] = [$test[$_ - 1], $length];
} (1 .. scalar @test);
[download]

@strings = map { 
$runlengths[$_][0] . "x" . $runlengths[$_][1] } ( 1 .. $#runlengths);
[download]

sub extract{
    my $index = shift;
    my ($last, $lastindex);
    foreach (@runlengths[1 .. $#runlengths]){
    ($last, $lastindex) = ($$_[0], $lastindex + $$_[1]);
    return $last if $index <= $lastindex - 1;
    }
    return undef;
}
[download]

[reply]
[d/l]
[select]