baxy77bax:

The amount of compression you can obtain is directly related to the amount of information in the data. The more constraints the data has, the less information it contains, and so the more you can compress it.

You've mentioned that you have an alphabet of three characters (A, B, C) and a random distribution. If you mean a uniform random distribution (i.e., all letters having a probability of 1/3), then each character in your string takes 1.585 bits to encode[1]. But your sample data shows that of 189 characters, A occurs 93 times, B occurs 44 times and C occurs 52 times. If that represents the actual distribution of characters, then you need only 1.503 bits to encode a character, allowing you to squeeze a bit more out of your 10K strings.

If you're able learn more constraints on the data (i.e., rules that valid data must follow), then you can squeeze the data even more. So if you want people to help you compress the data even further, we need to know more about your data if you want to get better compression. Note that *any* rules about the data can be relevant: so don't hold back! As an example, if each line was "similar" to the previous line by some simple rule, it may give us the information needed to

As an example, a couple years ago, you asked a similar question, and by looking over the code used to generate the data, we were able to determine that we could encode your 90 character strings in 51 bits, giving about 0.567 bits per character.

NOTES:

[1]: I used Shannon's Entropy formula to compute the number of bits per symbol.

EDIT: Fixed the wikipedia link as noted by fellow monks AnomalousMonk and LanX.

...roboticus

When your only tool is a hammer, all problems look like your thumb.


In reply to Re: How to efficently pack a string of 63 characters by roboticus
in thread How to efficently pack a string of 63 characters by baxy77bax

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.