baxy77bax has asked for the wisdom of the Perl Monks concerning the following question:

Hi,

I have a short question, well basically looking for some inspiration. What I have is a set of strings that i intend to compress using lzt. But before I do that, I was wondering if I could improve the compression by pre-encoding them. In this case what comes into my mind is Hoffman and arithmetic coding but for that i need to preprocess the entire set (or at leas a section) which for my application is not the most optimal way to go.. as an alternative I could go with run-length (and smart delta coding, maybe ?) and this is where my imagination stops.

About the problem:

String set is usually comes in batches of size 10000 (|S|=10000) Alphabet is 3 (|A|=3, A={A,B,C}) Size if each string is exactly 63 characters (|s1| = |s2| = .. = |s_| +S|| = 63)

Example:

ABBCBCAAAAABBCBCACCCAAAAACAAAAABBBBBAAAAABBAAAAAAAABBCCCACCAABC BCCCBCAACAABBBCAAACCAAAAACAAAAABBBBBAAAAABBAAAAAAAABBCCCACCABBC ABCCBBBAAAABBABCACABCCCCCCAAAAABBCBBCCCCAAAAAAAAAAAAACCCACCACCC ...
One can assume a random distribution of characters from Alphabet A = {A, B, C}

Possible solution:

position in a byte: 1 2 | 3 4 5 6 7 8 A A | Delta between mismatching consecutive characters
First two bits can be used to store the alphabet information :
00 A 01 B 11 C
Other 6 can be used for RL coding. Which will ultimately consume 1 byte for each mismatching character. Does anyone have a suggestion for a more optimal solution than this one?

Thank you :)

baxy

Replies are listed 'Best First'.
Re: How to efficently pack a string of 63 characters
by GrandFather (Saint) on Sep 08, 2021 at 21:00 UTC

    Essentially, you can't compress compressed data. There are tricks that can help with some data, but probably not with the data you present. Something like what you suggest can work well for digitally sampled analog data where most of the information is in the low bits so compressing the high bits of the samples separately can be advantageous. Doing that where all the high bits are the same for the entire data set only bulks out the final result.

    Optimising for fewest key strokes only makes sense transmitting to Pluto or beyond
Re: How to efficently pack a string of 63 characters
by roboticus (Chancellor) on Sep 09, 2021 at 14:11 UTC

    baxy77bax:

    The amount of compression you can obtain is directly related to the amount of information in the data. The more constraints the data has, the less information it contains, and so the more you can compress it.

    You've mentioned that you have an alphabet of three characters (A, B, C) and a random distribution. If you mean a uniform random distribution (i.e., all letters having a probability of 1/3), then each character in your string takes 1.585 bits to encode[1]. But your sample data shows that of 189 characters, A occurs 93 times, B occurs 44 times and C occurs 52 times. If that represents the actual distribution of characters, then you need only 1.503 bits to encode a character, allowing you to squeeze a bit more out of your 10K strings.

    If you're able learn more constraints on the data (i.e., rules that valid data must follow), then you can squeeze the data even more. So if you want people to help you compress the data even further, we need to know more about your data if you want to get better compression. Note that *any* rules about the data can be relevant: so don't hold back! As an example, if each line was "similar" to the previous line by some simple rule, it may give us the information needed to

    As an example, a couple years ago, you asked a similar question, and by looking over the code used to generate the data, we were able to determine that we could encode your 90 character strings in 51 bits, giving about 0.567 bits per character.

    NOTES:

    [1]: I used Shannon's Entropy formula to compute the number of bits per symbol.

    EDIT: Fixed the wikipedia link as noted by fellow monks AnomalousMonk and LanX.

    ...roboticus

    When your only tool is a hammer, all problems look like your thumb.

      To tell the truth i totally forgot about that post. Yes, back then I needed it for something completely different and totally did not connect the two problems. I should have been more aware of what I was posting and not waste time. I do apologize for that :(

      PS

      When I say random i always assume sampling from a uniform dist. (but yes, more information and rules i have, more compression I am going gain)

      Thnx

        > When I say random i always assume sampling from a uniform dist.

        The shown sample is obviously far from random or uniformly distributed

        > > > Example:

        ABBCBCAAAAABBCBCACCCAAAAACAAAAABBBBBAAAAABBAAAAAAAABBCCCACCAABC BCCCBCAACAABBBCAAACCAAAAACAAAAABBBBBAAAAABBAAAAAAAABBCCCACCABBC ABCCBBBAAAABBABCACABCCCCCCAAAAABBCBBCCCCAAAAAAAAAAAAACCCACCACCC ...

        Please show us some real input. And more of it.

        Otherwise, if that's real input, please realize it's very redundant.

        Otherwise couldn't compress such a distribution to 5% already.

        Cheers Rolf
        (addicted to the Perl Programming Language :)
        Wikisyntax for the Monastery

Re: How to efficently pack a string of 63 characters
by tybalt89 (Monsignor) on Sep 09, 2021 at 18:47 UTC

    Just for fun, here's a few different ways to do compression. Note that I assume you have a longer file with many lines that are 63 chars each. I remove the newlines (they can always be replaced later) because some compression algorithms can do better with longer strings of data, and the newline introduces a fourth character to the mix.

    Have fun !

    #!/usr/bin/perl use strict; # https://perlmonks.org/?node_id=11136582 use warnings; for my $try ( [ 'gzip/gunzip', \&compgzip, \&uncompgzip ], [ '2 bit code, 6 bit runlength', \&comp62, \&uncomp62 ], [ '2 bits per letter', \&comp2bits, \&uncomp2bits ], [ 'groups of 5,2,1', \&comp5, \&uncomp5 ], ) { my ($method, $comp, $uncomp) = @$try; print "\n------------------------------ Compression by $method\n\n"; my $data = <<END; ABBCBCAAAAABBCBCACCCAAAAACAAAAABBBBBAAAAABBAAAAAAAABBCCCACCAABC BCCCBCAACAABBBCAAACCAAAAACAAAAABBBBBAAAAABBAAAAAAAABBCCCACCABBC ABCCBBBAAAABBABCACABCCCCCCAAAAABBCBBCCCCAAAAAAAAAAAAACCCACCACCC END $data =~ tr/\n//d; # remove \n they can be re-inserted later print "length of data @{[ $data =~ tr/ABC// ]}\n"; my $compressed = $comp->($data); print "length of compressed data @{[ length $compressed ]}\n"; #use Data::Dump 'dd'; dd $compressed; # print unpack('H*', $compressed), "\n"; my $uncompressed = $uncomp->($compressed); printf "compressed to %.1f%%\n", 100 * length($compressed) / length $uncompressed; print $data eq $uncompressed ? "MATCH" : "************ no MATCH", "\ +n"; } # compress by groups of 5,2,1 to single letter sub comp5 { my @code = map glob('{A,B,C}' x $_), 5, 2, 1; my %code; @code{@code} = map chr, 1 .. @code; local $" = '|'; shift =~ s/(@code)/$code{$1}/gr } sub uncomp5 { my @code = map glob('{A,B,C}' x $_), 5, 2, 1; my %code; @code{map chr, 1 .. @code} = @code; join '', @code{split //, shift}; } # compress by lower two bits of letter sub comp2bits { my ($ans, $n) = ('', 0); vec($ans, $n++, 2) = 3 & ord $_ for split //, shift; $ans; } sub uncomp2bits { my $comp = shift; join '', map { ('', 'A', 'B', 'C')[ vec $comp, $_, 2] } 0 .. -1 + 4 * length $comp; } # compress by runlength or 6 bits length and 2 bits letter code sub comp62 { shift =~ s/([ABC])\1{0,62}/ chr( length($&) << 2 | ord($1) & 3) /ger +; } sub uncomp62 { shift =~ s/./ (($& & "\3") | '@') x (ord($&) >> 2) /gesr; } # compress by gzip use IO::Compress::Gzip qw(gzip); sub compgzip { gzip \(shift) => \(my $output); $output; } use IO::Uncompress::Gunzip qw(gunzip); sub uncompgzip { gunzip \(shift) => \(my $output); $output; }

    Outputs:

    ------------------------------ Compression by gzip/gunzip length of data 189 length of compressed data 77 compressed to 40.7% MATCH ------------------------------ Compression by 2 bit code, 6 bit runlen +gth length of data 189 length of compressed data 76 compressed to 40.2% MATCH ------------------------------ Compression by 2 bits per letter length of data 189 length of compressed data 48 compressed to 25.4% MATCH ------------------------------ Compression by groups of 5,2,1 length of data 189 length of compressed data 39 compressed to 20.6% MATCH

    I tried gzip on the groups 5,2,1 compression string and gzip made it longer :)

      > gzip/gunzip ...compressed to 40.7%

      I'm pretty sure that zip has some fix overhead which doesn't pay off with just 189 bytes input.

      > my @code = map glob('{A,B,C}'x $_), 5, 2, 1;

      I haven't run your code but it looks like you are mapping 9 possible chunks to a character needing a byte.

      Looks like you are wasting space. Already a naive 4 bit per chunk approach, i.e 1 byte for two chunks would double your efficiency.° (Needless to say, Huffman even more) (Update: sorry I totally misunderstood how your glob works)

      FWIW I doubt the input sample is realistic. Looks handmade by copying and altering the first line, hence the high redundancy. :)

      Cheers Rolf
      (addicted to the Perl Programming Language :)
      Wikisyntax for the Monastery

        I think you might be misreading the glob.

        The 5,2,1 maps 5 letters to one byte (look at the mapping hash with Data::Dump), so I'm getting a 5 to 1 reduction. (Ignoring the 2,1 which is just there for strings whose length is not a multiple of 5.)

        I don't understand your use of 'chunks', or where you get 9 of them.

        BTW: It's a 5 to 1 reduction independent of the redundancy in the string. Other compressors may use redundancy to do better. It's sort of a question of how random the letters really are.

        > I'm pretty sure that zip has some fix overhead which doesn't pay off with just 189 bytes input.

        to prove my point, here an altered version of your code which fakes longer data by rotating the original input and showing zip at 5% compression. That's factor 4 better than your champion.

        (Of course is rotating kind of biased, because it keeps most run length chunks intact and zip will efficiently Huffmann all Runs it finds.° But it's up to the OP to provide unbiased data, I'm no psychic... ;-)

        FWIW: zip is already at 18% after only tripling the input.

        use strict; # https://perlmonks.org/?node_id=11136582 use warnings; # originally https://perlmonks.org/?displ +aytype=displaycode;node_id=11136614;abspart=1;part=1 use feature 'say'; for my $try ( [ 'gzip/gunzip', \&compgzip, \&uncompgzip ], [ '2 bit code, 6 bit runlength', \&comp62, \&uncomp62 ], [ '2 bits per letter', \&comp2bits, \&uncomp2bits ], [ 'groups of 5,2,1', \&comp5, \&uncomp5 ], ) { my ($method, $comp, $uncomp) = @$try; print "\n------------------------------ Compression by $method\n\n +"; my $data = <<END; ABBCBCAAAAABBCBCACCCAAAAACAAAAABBBBBAAAAABBAAAAAAAABBCCCACCAABC BCCCBCAACAABBBCAAACCAAAAACAAAAABBBBBAAAAABBAAAAAAAABBCCCACCABBC ABCCBBBAAAABBABCACABCCCCCCAAAAABBCBBCCCCAAAAAAAAAAAAACCCACCACCC END my @data = split/\n/, $data; push @data, map{rotate($data[$_])} 0..2 for 1..100; # fake x1 +00 times data $data = join "", @data; # remove \n they can be re-inserted later print "length of data @{[ $data =~ tr/ABC// ]}\n"; my $compressed = $comp->($data); print "length of compressed data @{[ length $compressed ]}\n"; #use Data::Dump 'dd'; dd $compressed; # print unpack('H*', $compressed), "\n"; my $uncompressed = $uncomp->($compressed); printf "compressed to %.1f%%\n", 100 * length($compressed) / length $uncompressed; print $data eq $uncompressed ? "MATCH" : "************ no MATCH", +"\n"; } # by lanx sub rotate { # fake data from original samp +le # say "O: ", my $orig =shift; # say my $rnd = int rand 63; my $head = substr $orig,0,$rnd,""; # say "N: ", my $new = $orig.$head; return $new; } # compress by groups of 5,2,1 to single letter sub comp5 { my @code = map glob('{A,B,C}' x $_), 5, 2, 1; my %code; @code{@code} = map chr, 1 .. @code; local $" = '|'; shift =~ s/(@code)/$code{$1}/gr } sub uncomp5 { my @code = map glob('{A,B,C}' x $_), 5, 2, 1; my %code; @code{map chr, 1 .. @code} = @code; join '', @code{split //, shift}; } # compress by lower two bits of letter sub comp2bits { my ($ans, $n) = ('', 0); vec($ans, $n++, 2) = 3 & ord $_ for split //, shift; $ans; } sub uncomp2bits { my $comp = shift; join '', map { ('', 'A', 'B', 'C')[ vec $comp, $_, 2] } 0 .. -1 + 4 * length $comp; } # compress by runlength or 6 bits length and 2 bits letter code sub comp62 { shift =~ s/([ABC])\1{0,62}/ chr( length($&) << 2 | ord($1) & 3) +/ger; } sub uncomp62 { shift =~ s/./ (($& & "\3") | '@') x (ord($&) >> 2) /gesr; } # compress by gzip use IO::Compress::Gzip qw(gzip); sub compgzip { gzip \(shift) => \(my $output); $output; } use IO::Uncompress::Gunzip qw(gunzip); sub uncompgzip { gunzip \(shift) => \(my $output); $output; }

        -*- mode: compilation; default-directory: "d:/tmp/pm/" -*- Compilation started at Fri Sep 10 13:25:18 C:/Strawberry/perl/bin\perl.exe -w d:/tmp/pm/pack_63_chars.pl ------------------------------ Compression by gzip/gunzip length of data 19089 length of compressed data 1021 compressed to 5.3% MATCH ------------------------------ Compression by 2 bit code, 6 bit runlen +gth length of data 19089 length of compressed data 7714 compressed to 40.4% MATCH ------------------------------ Compression by 2 bits per letter length of data 19089 length of compressed data 4773 compressed to 25.0% MATCH ------------------------------ Compression by groups of 5,2,1 length of data 19089 length of compressed data 3819 compressed to 20.0% MATCH Compilation finished at Fri Sep 10 13:25:18

        Cheers Rolf
        (addicted to the Perl Programming Language :)
        Wikisyntax for the Monastery

        °) even after constantly reversing one part of the input I'm at 9% compression for factor 100 input.

Re: How to efficently pack a string of 63 characters
by jwkrahn (Abbot) on Sep 09, 2021 at 01:51 UTC

    You could probably do something like this:

    $ echo "ABBCBCAAAAABBCBCACCCAAAAACAAAAABBBBBAAAAABBAAAAAAAABBCCCACCAAB +C BCCCBCAACAABBBCAAACCAAAAACAAAAABBBBBAAAAABBAAAAAAAABBCCCACCABBC ABCCBBBAAAABBABCACABCCCCCCAAAAABBCBBCCCCAAAAAAAAAAAAACCCACCACCC" | per +l -le' use Data::Dumper; $Data::Dumper::Useqq = 1; my %alphabet = qw/ A 00 B 01 C 11 /; my %rev_alpha = reverse %alphabet +; while ( <> ) { chomp; print length(), " ", $_; # compress my $data = pack "B*", join "", map $alphabet{ $_ }, split //; print length $data, " ", Dumper $data; # decompress my $uncp = join "", map $rev_alpha{ $_ }, map /../g, unpack "B*", +$data; print length $uncp, " ", $uncp, "\n"; } ' 63 ABBCBCAAAAABBCBCACCCAAAAACAAAAABBBBBAAAAABBAAAAAAAABBCCCACCAABC 16 $VAR1 = "\27p\1w?\0000\1U\0\24\0\1\177<\34"; 64 ABBCBCAAAAABBCBCACCCAAAAACAAAAABBBBBAAAAABBAAAAAAAABBCCCACCAABCA 63 BCCCBCAACAABBBCAAACCAAAAACAAAAABBBBBAAAAABBAAAAAAAABBCCCACCABBC 16 $VAR1 = "\177p\301\\\17\0000\1U\0\24\0\1\177<\\"; 64 BCCCBCAACAABBBCAAACCAAAAACAAAAABBBBBAAAAABBAAAAAAAABBCCCACCABBCA 63 ABCCBBBAAAABBABCACABCCCCCCAAAAABBCBBCCCCAAAAAAAAAAAAACCCACCACCC 16 $VAR1 = "\37T\1G1\377\360\1u\377\0\0\0?<\374"; 64 ABCCBBBAAAABBABCACABCCCCCCAAAAABBCBBCCCCAAAAAAAAAAAAACCCACCACCCA

    One of the problems with this is that "00" is a valid value so you get extra "A"s padding the right hand side of the string on decompression.

    If you change the alphabet to:

    01 A 10 B 11 C

    It will decompress properly:

    $ echo "ABBCBCAAAAABBCBCACCCAAAAACAAAAABBBBBAAAAABBAAAAAAAABBCCCACCAAB +C BCCCBCAACAABBBCAAACCAAAAACAAAAABBBBBAAAAABBAAAAAAAABBCCCACCABBC ABCCBBBAAAABBABCACABCCCCCCAAAAABBCBBCCCCAAAAAAAAAAAAACCCACCACCC" | per +l -le' use Data::Dumper; $Data::Dumper::Useqq = 1; my %alphabet = qw/ A 01 B 10 C 11 /; my %rev_alpha = reverse %alphabet +; while ( <> ) { chomp; print length(), " ", $_; # compress my $data = pack "B*", join "", map $alphabet{ $_ }, split //; print length $data, " ", Dumper $data; # decompress my $uncp = join "", map $rev_alpha{ $_ }, map /../g, unpack "B*", +$data; print length $uncp, " ", $uncp, "\n"; } ' 63 ABBCBCAAAAABBCBCACCCAAAAACAAAAABBBBBAAAAABBAAAAAAAABBCCCACCAABC 16 $VAR1 = "k\265V\273\177UuV\252UiUV\277}l"; 63 ABBCBCAAAAABBCBCACCCAAAAACAAAAABBBBBAAAAABBAAAAAAAABBCCCACCAABC 63 BCCCBCAACAABBBCAAACCAAAAACAAAAABBBBBAAAAABBAAAAAAAABBCCCACCABBC 16 $VAR1 = "\277\265\326\255_UuV\252UiUV\277}\254"; 63 BCCCBCAACAABBBCAAACCAAAAACAAAAABBBBBAAAAABBAAAAAAAABBCCCACCABBC 63 ABCCBBBAAAABBABCACABCCCCCCAAAAABBCBBCCCCAAAAAAAAAAAAACCCACCACCC 16 $VAR1 = "o\251V\233v\377\365V\272\377UUU\177}\374"; 63 ABCCBBBAAAABBABCACABCCCCCCAAAAABBCBBCCCCAAAAAAAAAAAAACCCACCACCC
Re: How to efficently pack a string of 63 characters
by bliako (Abbot) on Sep 09, 2021 at 06:51 UTC
    One can assume a random distribution of characters from Alphabet A = {A, B, C}

    Uniformly random? If they follow another distribution (e.g. if A is more likely than B) you can get an advantage.

    Investigating the distribution of A,B,C can be insightful in deciding whether 6 bits for RL coding is too much. I can see that you get long sequences like AAAAAAAAAAAAA which can utilise a lot of RL bits, but what's the percentage of unused RL bits? For which characters?

    3^63 = 1144561273430837494885949696427 are all these combinations equally likely? Or even possible? If not, you can get an advantage.

    Your data is similar to gene sequences for which compression is a problem. See Compression_of_genomic_sequencing_data for ideas.

    bw, bliako

Re: How to efficently pack a string of 63 characters
by LanX (Saint) on Sep 08, 2021 at 20:48 UTC
    If the sample is real and representative, then it looks like you have lots of similarities between two lines, akin to image data.

    I'd try:

    • calculate the deltas between two lines like C-A=2
    • do a run length encoding of those deltas , like (2,5),(0,20),...
    • probably try Huffman code the run length pairs.

    You can skip step 3 and try direct zipping tho after step 2, because Huffman is part of zip IIRC.

    NB: feeding binary data into zip might be contra productive. Better a "readable" text where the patterns are visible.

    Cheers Rolf
    (addicted to the Perl Programming Language :)
    Wikisyntax for the Monastery

      > calculate the deltas between two lines like C-A=2

      OK I tried this and fed the result into IO::Compress::Gzip , in all scenarios this only yeld at most one percent in gain.

      I'd say gzip is the best choice here, it compressed between 4% and 24% (worst case) without any prior knowledge. And 19.81% is the optimum for the worst case of total entropy.

      I also tried providing extra flags for best compression and minimal head, but the benefit was again only 1% at most.

      And this module is in core.

      >corelist IO::Compress::Gzip Data for 2021-01-23 IO::Compress::Gzip was first released with perl v5.9.4

      I think the decision is a no brainer ...

      Cheers Rolf
      (addicted to the Perl Programming Language :)
      Wikisyntax for the Monastery

        In my testing I also tried IO::Compress::Bzip2 which was always slightly better than Gzip and is also in core.

        corelist IO::Compress::Bzip2 Data for 2021-05-20 IO::Compress::Bzip2 was first released with perl v5.10.1
Re: How to efficently pack a string of 63 characters
by Discipulus (Canon) on Sep 09, 2021 at 12:12 UTC
    Hello baxy77bax,

    sorry for my naive suggestion, infact is not at all my field (well.. what is my field? ;) but what about something like:

    use strict; use warnings; my @str = qw( ABBCBCAAAAABBCBCACCCAAAAACAAAAABBBBBAAAAABBAAAAAAAABBCCCACCAABC BCCCBCAACAABBBCAAACCAAAAACAAAAABBBBBAAAAABBAAAAAAAABBCCCACCABBC ABCCBBBAAAABBABCACABCCCCCCAAAAABBCBBCCCCAAAAAAAAAAAAACCCACCACCC ); foreach my $str ( @str ){ print "$str\n", $str =~ s/([A-C])(\1{2,})/$1.(1+length($2))/reg, "\n\n"; } __END__ ABBCBCAAAAABBCBCACCCAAAAACAAAAABBBBBAAAAABBAAAAAAAABBCCCACCAABC ABBCBCA5BBCBCAC3A5CA5B5A5BBA8BBC3ACCAABC BCCCBCAACAABBBCAAACCAAAAACAAAAABBBBBAAAAABBAAAAAAAABBCCCACCABBC BC3BCAACAAB3CA3CCA5CA5B5A5BBA8BBC3ACCABBC ABCCBBBAAAABBABCACABCCCCCCAAAAABBCBBCCCCAAAAAAAAAAAAACCCACCACCC ABCCB3A4BBABCACABC6A5BBCBBC4A13C3ACCAC3

    L*

    There are no rules, there are no thumbs..
    Reinvent the wheel, then learn The Wheel; may be one day you reinvent one of THE WHEELS.
      Yes but runlength encoding (+ Huffman) is already part of zip

      See http://www.zlib.net/feldspar.html

      The only chance to improve zip is to compress redundancy which isn't found by zip while providing an input which is character based (which you are doing)

      That's why I suggested a line delta compression first in my other answer, provided the example was real (which is not a given with this OP)

      See Re: How to efficently pack a string of 63 characters

      (There is a chance tho that zip is clever enough to notice such repeating groups too. But that kind of stuff is normally part of image not text compression)

      Cheers Rolf
      (addicted to the Perl Programming Language :)
      Wikisyntax for the Monastery

Re: How to efficently pack a string of 63 characters
by Anonymous Monk on Sep 08, 2021 at 20:58 UTC
    Ironically, you might reduce the efficiency of the compression, because you have squeezed out the "redundancy" upon which text compression algorithms depend. Simply stated, these algorithms recognize "a sequence that we have recently seen before," and replace them with a coded reference to that sequence. Your pre-processing might remove that redundancy, reducing or even eliminating the chance to substitute codes. However, there's really only one way to find out: an empirical test. Generate a few thousand bits of plausibly representative data (unless you have the real stuff at hand), and try it both ways.
      True but "hand optimized" might be more efficient than automatically packed, because human heuristic might find unusual patterns.

      We will only be able to tell after the OP provided us with more typical sample data.

      Cheers Rolf
      (addicted to the Perl Programming Language :)
      Wikisyntax for the Monastery

Re: How to efficently pack a string of 63 characters
by Marshall (Canon) on Sep 10, 2021 at 13:23 UTC
    From your problem statement, you have not only just ASCII characters but only upper case ASCII, not binary data. I would go "old school" and use the Baudot_code. This gets each 8 bit byte to 5 bits. Shifting between upper and lower case can be done but requires an additional "control" character.

    The advantage of this is that this is a well known standard that is still in use today. There is nothing tricky or fancy about this. That means that it can be done very efficiently. 8 bits -> 5 bits according to a standard protocol.

    Update: That means that 3 of every 8 bits are not needed. 3/8 means, 37.5% compression for almost nothing.

    Also if this is true: "One can assume a random distribution of characters from Alphabet A = {A, B, C}". There is not much that you can do past baudot code. Random data cannot be compressed. Oh, I perhaps see now, the alphabet will be limited to a set less than 26 characters. In that case, perhaps make your own version of the Baudot code? Perhaps send a block of translation tables before the actual data? It does appear to me that you can achieve a huge amount of compression with a table driven, very efficient algorithm.