in reply to pack unpack charcount repetition

Update:Removed errent print from code as pointed out by cciulla below.

Also, this was intended as slightly tongue in cheek answer in response to the (golf) nomenclature in the (original) title. I apologise for omiting the smiley. See my addendom below by way of recompemse.

</update>

How about 10:1 compression?

use Devel::Size qw[total_size]; my @a = ( 0,54,28,76,126,0,28,54,62,54,0,28,54,60,48,0,54,54,62,54,0 ) +; my $packed = pack 'C*', @a; print 'Memory requirement of ', total_size(\@a) , ' bytes is reduced to ', total_size(\$packed), ' bytes'; Memory requirement of 476 bytes is reduced to 46 bytes.

Addendum

There are two basic was of compressing character data.

Bitwise reduction (Theres probably a better term). In this you reduce the storage requirement by using less than 8-bit per character. For example. If you only needed to represent uppercase alpha then you could get away with 5-bits/char so you could get a 3/8 ths reduction by packing them into a bitstream

But as you have 21 bytes and a range of values 0-128, you would at best be saving 1-bit per byte. 21-bits saves 2-bytes! Hardly worth the effort.

Then there is the dictionary method that you tried yourself. In this, you build a dictionary of the common bytes (actually strings of common byte sequences work better, but your sample data is too short and varied for this to work well), and then represent the bytes by indexes into the dictionary. The problem as you saw is that representing 1 byte by another, Plus the dictionary, makes it worse rather than better. However, if you then use the first technique to reduce the storage requirement of the indices, then you get somewhere.

Your dictionary has 8 entries [0,28,48,54,60,62,76,126 ] (which is unfortunate. If it where only 7, the compression would be greater). That means you need 4 bits/byte for the indices.

Update 2 I was having a bad day. 8 values can be indexed with 3 bits! So 21*3/8 gives 8 bytes not 11, so a total of 16 bytes is all that is required. In addition 1-bit per byte if the dictionary could be shed saving another byte.

my @ind = (0,3,1,6,7,0,1,3,5,3,0,1,3,4,2,0,3,3,5,3,0); my $ind = ''; vec( $ind, $_, 4) = $ind[$_] for 0 .. $#ind; print length $ind; #print 11

This has allowed you to pack the 21 indicies into 11 bytes. But now you need to concatenate that with the dictionary of 8 bytes, and you are back to 19 bytes!

If your data contains any common sequences of bytes then you can store multi-byte sequences in the dictionary and represent them with a single index and possibly get a greater saving. A cursory inspection show two such sequences, of two repetitions each 0, 54 and 54, 0 both appear twice. which mean that you could reduce the number of indices by 2, saving 1 byte. But the dictionary would have to grow by 4 bytes to do it. So that doesn't help much either.

If your range of chars was less, or the there were more data and a greater chance common sequences, then you might get better results. Unfortunately, your dataset is such that it could almost have been purposely chosen to be uncompressable:)


Examine what is said, not who speaks.
"Efficiency is intelligent laziness." -David Dunham
"When I'm working on a problem, I never think about beauty. I think only how to solve the problem. But when I have finished, if the solution is not beautiful, I know it is wrong." -Richard Buckminster Fuller


Replies are listed 'Best First'.
Re: Re: pack unpack charcount repetition
by cciulla (Friar) on Jun 09, 2003 at 12:15 UTC

    ++BrowserUK.

    However, the print statement in

    my $packed = print pack 'C*', @a;
    may cause some problems when you try to turn the sausage back into the pig.

    my $packed = pack 'C*', @a;
    might get you further if you need to unpack it. ;)

    my @a = ( 0,54,28,76,126,0,28,54,62,54,0,28,54,60,48,0,54,54,62,54,0 ) +; my $packed = pack 'C*', @a; my (@unpacked) = unpack 'C*', $packed; print "$packed\n"; print join ',', @unpacked;

Re: Re: pack unpack charcount repetition
by denthijs (Acolyte) on Jun 09, 2003 at 12:17 UTC
    yes!
    thankyou :)
    finally starting to grasp pack
    this saves a lot of chars!
    (0,54,28,76,126,0,28,54,62,54,0,28,54,60,48,0,54,54,62,54,0); unpack'C*',unpack'u*',"5`#8<3'X`'#8^-@`<-CPP`#8V/C8`";
    the uuencode is for .signature usage
    thnx++
Re: Re: pack unpack charcount repetition
by zby (Vicar) on Jun 09, 2003 at 12:31 UTC
    You count the memory requirement of an uninitialized variable. Beside that, the guy asked about source code reduction - not computer memory.

      Sorry! I let my stupid sense of humour get the better of me. I've updated the post to give a sensible answer also.


      Examine what is said, not who speaks.
      "Efficiency is intelligent laziness." -David Dunham
      "When I'm working on a problem, I never think about beauty. I think only how to solve the problem. But when I have finished, if the solution is not beautiful, I know it is wrong." -Richard Buckminster Fuller


        Well - actually that could form a valid solution - just use the packed string in the source code. I can't get it right here - but probably someone with more experience with unpack etc can succede with this idea.
Re: Re: pack unpack charcount repetition
by denthijs (Acolyte) on Jun 09, 2003 at 18:53 UTC
    wholy feaces, thnx! :)
    EveryBody++

    i never really knew what i was trying to do untill i read all this :)
    keep it commin ;)