baxy77bax has asked for the wisdom of the Perl Monks concerning the following question:

Hi,

recently i am having some problems with storing some numbers. More precisely I have a range of numbers all between 1 and 100. Example:

1,5,6,12,17,43,56,73,98
what i need to do is to encode this array using at mose 2 bytes. A simple encoding scheme would be to create a bit string and then flip 0's to 1's at the above stated positions. However, a bit-string for this array is appx. 13 bytes which is 11 more then i have at my disposal. Generally there is always a trade-off between memory and speed. This time speed is not extremely important for me. so i was wondering if someone knows and is willing to share any hashing scheme, or something that could help me to squeeze 13 bytes into 2.

thank you

baxy

Replies are listed 'Best First'.
Re: coding question
by Corion (Patriarch) on Jan 04, 2014 at 15:06 UTC

    Maybe you can share a bit about how the numbers between 1 and 100 are distributed, or some more properties of them? In general, it's impossible to encode an arbitrary subset of the numbers 1 to 100 in less than 100 bits.

      yea , after posting the question it kind of hit me too. I recall from information theory that something like this is maybe not possible. some other information that i can give to you is : the numbers are always ordered (from smallest to largers). numbers are uniformly distributed and usually there are about 50 numbers in the array.

      ps is it possible maybe to tahe only first 8 bits for numbers 1-8 and then use the generated bit-string as a key for encoding next 8 bits and then repeate this recursivly ... hm ... i cannot see it ....

Re: coding question
by davido (Cardinal) on Jan 04, 2014 at 18:11 UTC

    Hashing is used to summarize content. There's no guarantee that collisions won't happen if the hashing algorithm creates a hash of fewer bits than the maximum possible entropy in the data set. And for your purposes, you are virtually assured of having collisions. As corion mentioned, you cannot uniquely represent the state of every number between 1 and 100 with fewer than 100 bits (your bit string approach is already an optimal solution if you cannot tolerate information loss). If you use a hashing solution instead, you will have collisions, since 16 bits is an inadequate key size to avoid collisions.

    Can you tolerate collisions? Could you solve the problem that motivated this quest by taking a different route entirely (In other words, could this be an X-Y problem?)?


    Dave

Re: coding question
by oiskuu (Hermit) on Jan 04, 2014 at 19:46 UTC

    If subset size were known, you'd be counting the combinations.

    for my $n (100) { for my $m (1..100) { my $i = 1; $i *= ($n+1-$_)/$_ for 1..$m; printf "C($m,$n) = $i (%.3f bits)\n", log($i)/log(2); } }
    Sadly, representing even three elements requires > 16 bits.