in reply to How to efficently pack a string of 63 characters

baxy77bax:

The amount of compression you can obtain is directly related to the amount of information in the data. The more constraints the data has, the less information it contains, and so the more you can compress it.

You've mentioned that you have an alphabet of three characters (A, B, C) and a random distribution. If you mean a uniform random distribution (i.e., all letters having a probability of 1/3), then each character in your string takes 1.585 bits to encode[1]. But your sample data shows that of 189 characters, A occurs 93 times, B occurs 44 times and C occurs 52 times. If that represents the actual distribution of characters, then you need only 1.503 bits to encode a character, allowing you to squeeze a bit more out of your 10K strings.

If you're able learn more constraints on the data (i.e., rules that valid data must follow), then you can squeeze the data even more. So if you want people to help you compress the data even further, we need to know more about your data if you want to get better compression. Note that *any* rules about the data can be relevant: so don't hold back! As an example, if each line was "similar" to the previous line by some simple rule, it may give us the information needed to

As an example, a couple years ago, you asked a similar question, and by looking over the code used to generate the data, we were able to determine that we could encode your 90 character strings in 51 bits, giving about 0.567 bits per character.

NOTES:

[1]: I used Shannon's Entropy formula to compute the number of bits per symbol.

EDIT: Fixed the wikipedia link as noted by fellow monks AnomalousMonk and LanX.

...roboticus

When your only tool is a hammer, all problems look like your thumb.

  • Comment on Re: How to efficently pack a string of 63 characters

Replies are listed 'Best First'.
Re^2: How to efficently pack a string of 63 characters
by AnomalousMonk (Archbishop) on Sep 09, 2021 at 19:36 UTC
Re^2: How to efficently pack a string of 63 characters
by baxy77bax (Deacon) on Sep 09, 2021 at 21:05 UTC
    To tell the truth i totally forgot about that post. Yes, back then I needed it for something completely different and totally did not connect the two problems. I should have been more aware of what I was posting and not waste time. I do apologize for that :(

    PS

    When I say random i always assume sampling from a uniform dist. (but yes, more information and rules i have, more compression I am going gain)

    Thnx

      > When I say random i always assume sampling from a uniform dist.

      The shown sample is obviously far from random or uniformly distributed

      > > > Example:

      ABBCBCAAAAABBCBCACCCAAAAACAAAAABBBBBAAAAABBAAAAAAAABBCCCACCAABC BCCCBCAACAABBBCAAACCAAAAACAAAAABBBBBAAAAABBAAAAAAAABBCCCACCABBC ABCCBBBAAAABBABCACABCCCCCCAAAAABBCBBCCCCAAAAAAAAAAAAACCCACCACCC ...

      Please show us some real input. And more of it.

      Otherwise, if that's real input, please realize it's very redundant.

      Otherwise couldn't compress such a distribution to 5% already.

      Cheers Rolf
      (addicted to the Perl Programming Language :)
      Wikisyntax for the Monastery