denthijs has asked for the wisdom of the Perl Monks concerning the following question:

in my search for wizzdom i found myself struggling to grasp the concepts of "compression" of a particular "data" element.
(minime was mimicing quotes alongside with me )

In my code i have a list of numbers
(0,54,28,76,126,0,28,54,62,54,0,28,54,60,48,0,54,54,62,54,0)

seeing all those reaccurences i figured, lets reference them, surely that'll save me some chars
(0,28,48,54,60,62,76,126)[0,3,1,6,7,0,1,3,5,3,0,1,3,4,2,0,3,3,5,3,0]
stunned i was of my own incompetence, fool i thought, you thought i'd be so easy

after consulting various oracles and scrolls,
either too advanced or too off topic,
i finally lay my head upon your feat and ask for your guidance

how to make a shorter representation of that list?

any links regarding basic concepts of compression that could help me in these kind of situations would also be great!

cheers all!

update (broquaint): dropped (golf) prefix from title

Replies are listed 'Best First'.
Re: pack unpack charcount repetition
by BrowserUk (Patriarch) on Jun 09, 2003 at 11:34 UTC

    Update:Removed errent print from code as pointed out by cciulla below.

    Also, this was intended as slightly tongue in cheek answer in response to the (golf) nomenclature in the (original) title. I apologise for omiting the smiley. See my addendom below by way of recompemse.

    </update>

    How about 10:1 compression?

    use Devel::Size qw[total_size]; my @a = ( 0,54,28,76,126,0,28,54,62,54,0,28,54,60,48,0,54,54,62,54,0 ) +; my $packed = pack 'C*', @a; print 'Memory requirement of ', total_size(\@a) , ' bytes is reduced to ', total_size(\$packed), ' bytes'; Memory requirement of 476 bytes is reduced to 46 bytes.

    Addendum

    There are two basic was of compressing character data.

    Bitwise reduction (Theres probably a better term). In this you reduce the storage requirement by using less than 8-bit per character. For example. If you only needed to represent uppercase alpha then you could get away with 5-bits/char so you could get a 3/8 ths reduction by packing them into a bitstream

    But as you have 21 bytes and a range of values 0-128, you would at best be saving 1-bit per byte. 21-bits saves 2-bytes! Hardly worth the effort.

    Then there is the dictionary method that you tried yourself. In this, you build a dictionary of the common bytes (actually strings of common byte sequences work better, but your sample data is too short and varied for this to work well), and then represent the bytes by indexes into the dictionary. The problem as you saw is that representing 1 byte by another, Plus the dictionary, makes it worse rather than better. However, if you then use the first technique to reduce the storage requirement of the indices, then you get somewhere.

    Your dictionary has 8 entries [0,28,48,54,60,62,76,126 ] (which is unfortunate. If it where only 7, the compression would be greater). That means you need 4 bits/byte for the indices.

    Update 2 I was having a bad day. 8 values can be indexed with 3 bits! So 21*3/8 gives 8 bytes not 11, so a total of 16 bytes is all that is required. In addition 1-bit per byte if the dictionary could be shed saving another byte.

    my @ind = (0,3,1,6,7,0,1,3,5,3,0,1,3,4,2,0,3,3,5,3,0); my $ind = ''; vec( $ind, $_, 4) = $ind[$_] for 0 .. $#ind; print length $ind; #print 11

    This has allowed you to pack the 21 indicies into 11 bytes. But now you need to concatenate that with the dictionary of 8 bytes, and you are back to 19 bytes!

    If your data contains any common sequences of bytes then you can store multi-byte sequences in the dictionary and represent them with a single index and possibly get a greater saving. A cursory inspection show two such sequences, of two repetitions each 0, 54 and 54, 0 both appear twice. which mean that you could reduce the number of indices by 2, saving 1 byte. But the dictionary would have to grow by 4 bytes to do it. So that doesn't help much either.

    If your range of chars was less, or the there were more data and a greater chance common sequences, then you might get better results. Unfortunately, your dataset is such that it could almost have been purposely chosen to be uncompressable:)


    Examine what is said, not who speaks.
    "Efficiency is intelligent laziness." -David Dunham
    "When I'm working on a problem, I never think about beauty. I think only how to solve the problem. But when I have finished, if the solution is not beautiful, I know it is wrong." -Richard Buckminster Fuller


      ++BrowserUK.

      However, the print statement in

      my $packed = print pack 'C*', @a;
      may cause some problems when you try to turn the sausage back into the pig.

      my $packed = pack 'C*', @a;
      might get you further if you need to unpack it. ;)

      my @a = ( 0,54,28,76,126,0,28,54,62,54,0,28,54,60,48,0,54,54,62,54,0 ) +; my $packed = pack 'C*', @a; my (@unpacked) = unpack 'C*', $packed; print "$packed\n"; print join ',', @unpacked;

      yes!
      thankyou :)
      finally starting to grasp pack
      this saves a lot of chars!
      (0,54,28,76,126,0,28,54,62,54,0,28,54,60,48,0,54,54,62,54,0); unpack'C*',unpack'u*',"5`#8<3'X`'#8^-@`<-CPP`#8V/C8`";
      the uuencode is for .signature usage
      thnx++
      You count the memory requirement of an uninitialized variable. Beside that, the guy asked about source code reduction - not computer memory.

        Sorry! I let my stupid sense of humour get the better of me. I've updated the post to give a sensible answer also.


        Examine what is said, not who speaks.
        "Efficiency is intelligent laziness." -David Dunham
        "When I'm working on a problem, I never think about beauty. I think only how to solve the problem. But when I have finished, if the solution is not beautiful, I know it is wrong." -Richard Buckminster Fuller


      wholy feaces, thnx! :)
      EveryBody++

      i never really knew what i was trying to do untill i read all this :)
      keep it commin ;)
Re: pack unpack charcount repetition
by waswas-fng (Curate) on Jun 09, 2003 at 11:21 UTC
    Seeing as all are below 128, you could fit two numbers per byte and save 50% space off the top. the first four bitxs would be $number[0] next four bits would be $number[1] etc. to expand just parse the bits and form the array. if you are looking for a more virsitile compression alg, then check out the sources for gzip or bzip2. If you just need to compress a dataset in your program then check out your options at compress each type performs differently with different datasets. I think bzip2 is still a leader overall.

    -Waswas

      How can you fit 0 .. 128 into 4-bits?


      Examine what is said, not who speaks.
      "Efficiency is intelligent laziness." -David Dunham
      "When I'm working on a problem, I never think about beauty. I think only how to solve the problem. But when I have finished, if the solution is not beautiful, I know it is wrong." -Richard Buckminster Fuller


      Although this has been said, it hasn't been explained:
      The difference between using a field with a max of 128 and a max of 256 is only one bit, not 4. For those not yet versed in binary, that is to say: for 87654321
      bit positionplace value
      11
      22
      34
      48
      516
      632
      764
      8128

      Now, shortening it to 7 bits would give you a 1/8 compression at the expense of not allowing any data that uses the 8th bit. I regard this as a mistake, because it can cause all kinds of problems, re: SMTP, AIX /dev/tty*

      As an aside, this data set will probably be too short to get a decent compression because of different sorts of overhead. Maybe this is a creative use for freeze/thaw??? (that's from Storable).

      Update: 2 Mins later: Ah, I am aware that the OP wanted code reduction, not compression. Well, ignore the last of my ramblings.

      mhoward - at - hattmoward.org
      >Seeing as all are below 128, you could fit two numbers per byte and save 50% space off the top. the first four bitxs would be $number[0] next four bits would be $number[1] etc.
      to expand just parse the bits and form the array
      this sounds like a great plan, ..but
      just parse the bits and form the array,....
      is just sitting there staring at me, could you show me?

        It ain't gonna happen. The maximum value you can store in four bits (a nibble) is decimal 15.

        Unless, of course, waswas-fng has access to a computer that uses a quantum bit that has more states than "off" and "on".

      I am sorry, I posted this at 5 am or so my time and had not had any redbull yet =) here is what i meant to say: If (0,28,48,54,60,62,76,126) = (0000,1000,0100,1100,0010,1010,1110,0001) then you can encode the original array as:
      [0000,1100,1000,1110,0001,0000,1000,1100,1010,1100,0000,1000,1100,0010 +,0100,0000,1100,1100,1010,1100,0000]
      or as bits:
      0000110010001110000100001000110010101100000010001100001001000000110011 +00101011000000
      then convert to 12 bytes (filling the last 4 bits with 1111 to tag padding). to get the data back you need to chomp 4 bits per array slot until done or you hit 1111
      0,54,28,76,126,0,28,54,62,54,0,28,54,60,48,0,54,54,62,54,0
      stored in a file is 58 bytes. As long as the set of unique numbers stays fixed or less than 15 you can save space storing in this way and the larger the array the more savings you will see.

      it was really silly anyways because you will note for such a small set the perl code to compress/decompress + the data compressed is larger than the starting data =)

      -Waswas
        my dataset really was chosen awefully it seems, ..
        still, this could be very handy if i werent to encode yaph(248131) but the full Yet Another Perl Hacker in the future.
        or countless other things, i'm sure ;)
        *less is more* :)
RLE for simple array compression
by wufnik (Friar) on Jun 09, 2003 at 17:02 UTC
    hola;

    from what i gather i think this is a case for good ol' Run Length Encoding (RLE):

    here is RLE for an array in a couple of lines of not very strict code.

    the basic idea is to look for repeated runs on a single symbol in your array, and replace by a single instance of the symbol and a run count. great if your set of symbols (numbers, in this case) is small.

    here goes. result stored in @runlengths as refs.

    @test = (23,23,4,8,21,90,90,90,90,2,2,2,19,21,19); map { $length = ($test[$_ - 1] == $last)? $length + 1: 1; $run++ unless $test[$_ - 1] == $last; $last = $test[$_ - 1]; $runlengths[$run] = [$test[$_ - 1], $length]; } (1 .. scalar @test);
    ok, so what does it look like?

    using
    @strings = map { $runlengths[$_][0] . "x" . $runlengths[$_][1] } ( 1 .. $#runlengths);
    gives us the set of strings

    ("23x2","4x1","8x1","21x1" ... "19x1")

    and i guess no compresssion routine is complete without an extraction function, which i have not optimized much here...
    sub extract{ my $index = shift; my ($last, $lastindex); foreach (@runlengths[1 .. $#runlengths]){ ($last, $lastindex) = ($$_[0], $lastindex + $$_[1]); return $last if $index <= $lastindex - 1; } return undef; }
    a final note: no discussion of compression and perl is complete without reference to the Mark Jason-Dominus article on Huffman encoding, which would grace even a royal toilet:

    http://perl.plover.com/Huffman/huffman.html

    that is meant as sincere flattery btw.

    hope that helps

    ...wufnik

    -- in the world of the mules there are no rules --