How to efficently pack a string of 63 characters

baxy77bax has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: How to efficently pack a string of 63 characters by GrandFather (Saint) on Sep 08, 2021 at 21:00 UTC
Essentially, you can't compress compressed data. There are tricks that can help with some data, but probably not with the data you present. Something like what you suggest can work well for digitally sampled analog data where most of the information is in the low bits so compressing the high bits of the samples separately can be advantageous. Doing that where all the high bits are the same for the entire data set only bulks out the final result. Optimising for fewest key strokes only makes sense transmitting to Pluto or beyond	[reply]
Re: How to efficently pack a string of 63 characters by roboticus (Chancellor) on Sep 09, 2021 at 14:11 UTC
baxy77bax: The amount of compression you can obtain is directly related to the amount of information in the data. The more constraints the data has, the less information it contains, and so the more you can compress it. You've mentioned that you have an alphabet of three characters (A, B, C) and a random distribution. If you mean a uniform random distribution (i.e., all letters having a probability of 1/3), then each character in your string takes 1.585 bits to encode^[1]. But your sample data shows that of 189 characters, A occurs 93 times, B occurs 44 times and C occurs 52 times. If that represents the actual distribution of characters, then you need only 1.503 bits to encode a character, allowing you to squeeze a bit more out of your 10K strings. If you're able learn more constraints on the data (i.e., rules that valid data must follow), then you can squeeze the data even more. So if you want people to help you compress the data even further, we need to know more about your data if you want to get better compression. Note that any rules about the data can be relevant: so don't hold back! As an example, if each line was "similar" to the previous line by some simple rule, it may give us the information needed to As an example, a couple years ago, you asked a similar question, and by looking over the code used to generate the data, we were able to determine that we could encode your 90 character strings in 51 bits, giving about 0.567 bits per character. NOTES: [1]: I used Shannon's Entropy formula to compute the number of bits per symbol. EDIT: Fixed the wikipedia link as noted by fellow monks AnomalousMonk and LanX. ...roboticus When your only tool is a hammer, all problems look like your thumb.	[reply]
Re^2: How to efficently pack a string of 63 characters by AnomalousMonk (Archbishop) on Sep 09, 2021 at 19:36 UTC
Entropy (information_theory) ? Give a man a fish: `<%-{-{-{-<`	[reply] [d/l]
Re^3: How to efficently pack a string of 63 characters by LanX (Saint) on Sep 09, 2021 at 20:41 UTC
yes, he confused `wiki://` for `wp://` when linking more here: What shortcuts can I use for linking to other information? Cheers Rolf _{(addicted to the Perl Programming Language :) Wikisyntax for the Monastery}	[reply] [d/l] [select]
Re^2: How to efficently pack a string of 63 characters by baxy77bax (Deacon) on Sep 09, 2021 at 21:05 UTC
To tell the truth i totally forgot about that post. Yes, back then I needed it for something completely different and totally did not connect the two problems. I should have been more aware of what I was posting and not waste time. I do apologize for that :( PS When I say random i always assume sampling from a uniform dist. (but yes, more information and rules i have, more compression I am going gain) Thnx	[reply]
Re^3: How to efficently pack a string of 63 characters by LanX (Saint) on Sep 10, 2021 at 13:17 UTC
> When I say random i always assume sampling from a uniform dist. The shown sample is obviously far from random or uniformly distributed > > > Example: `ABBCBCAAAAABBCBCACCCAAAAACAAAAABBBBBAAAAABBAAAAAAAABBCCCACCAABC BCCCBCAACAABBBCAAACCAAAAACAAAAABBBBBAAAAABBAAAAAAAABBCCCACCABBC ABCCBBBAAAABBABCACABCCCCCCAAAAABBCBBCCCCAAAAAAAAAAAAACCCACCACCC ...` [download] Please show us some real input. And more of it. Otherwise, if that's real input, please realize it's very redundant. Otherwise couldn't compress such a distribution to 5% already. Cheers Rolf _{(addicted to the Perl Programming Language :) Wikisyntax for the Monastery}	[reply] [d/l]
Re: How to efficently pack a string of 63 characters by tybalt89 (Monsignor) on Sep 09, 2021 at 18:47 UTC
Just for fun, here's a few different ways to do compression. Note that I assume you have a longer file with many lines that are 63 chars each. I remove the newlines (they can always be replaced later) because some compression algorithms can do better with longer strings of data, and the newline introduces a fourth character to the mix. Have fun ! #!/usr/bin/perl use strict; # https://perlmonks.org/?node_id=11136582 use warnings; for my $try ( [ 'gzip/gunzip', \&compgzip, \&uncompgzip ], [ '2 bit code, 6 bit runlength', \&comp62, \&uncomp62 ], [ '2 bits per letter', \&comp2bits, \&uncomp2bits ], [ 'groups of 5,2,1', \&comp5, \&uncomp5 ], ) { my ($method, $comp, $uncomp) = @$try; print "\n------------------------------ Compression by $method\n\n"; my $data = <<END; ABBCBCAAAAABBCBCACCCAAAAACAAAAABBBBBAAAAABBAAAAAAAABBCCCACCAABC BCCCBCAACAABBBCAAACCAAAAACAAAAABBBBBAAAAABBAAAAAAAABBCCCACCABBC ABCCBBBAAAABBABCACABCCCCCCAAAAABBCBBCCCCAAAAAAAAAAAAACCCACCACCC END $data =~ tr/\n//d; # remove \n they can be re-inserted later print "length of data @{[ $data =~ tr/ABC// ]}\n"; my $compressed = $comp->($data); print "length of compressed data @{[ length $compressed ]}\n"; #use Data::Dump 'dd'; dd $compressed; # print unpack('H', $compressed), "\n"; my $uncompressed = $uncomp->($compressed); printf "compressed to %.1f%%\n", 100 length($compressed) / length $uncompressed; print $data eq $uncompressed ? "MATCH" : "************ no MATCH", "\ +n"; } # compress by groups of 5,2,1 to single letter sub comp5 { my @code = map glob('{A,B,C}' x $_), 5, 2, 1; my %code; @code{@code} = map chr, 1 .. @code; local $" = '\|'; shift =~ s/(@code)/$code{$1}/gr } sub uncomp5 { my @code = map glob('{A,B,C}' x $_), 5, 2, 1; my %code; @code{map chr, 1 .. @code} = @code; join '', @code{split //, shift}; } # compress by lower two bits of letter sub comp2bits { my ($ans, $n) = ('', 0); vec($ans, $n++, 2) = 3 & ord $_ for split //, shift; $ans; } sub uncomp2bits { my $comp = shift; join '', map { ('', 'A', 'B', 'C')[ vec $comp, $_, 2] } 0 .. -1 + 4 * length $comp; } # compress by runlength or 6 bits length and 2 bits letter code sub comp62 { shift =~ s/([ABC])\1{0,62}/ chr( length($&) << 2 \| ord($1) & 3) /ger +; } sub uncomp62 { shift =~ s/./ (($& & "\3") \| '@') x (ord($&) >> 2) /gesr; } # compress by gzip use IO::Compress::Gzip qw(gzip); sub compgzip { gzip \(shift) => \(my $output); $output; } use IO::Uncompress::Gunzip qw(gunzip); sub uncompgzip { gunzip \(shift) => \(my $output); $output; } [download] Outputs: ------------------------------ Compression by gzip/gunzip length of data 189 length of compressed data 77 compressed to 40.7% MATCH ------------------------------ Compression by 2 bit code, 6 bit runlen +gth length of data 189 length of compressed data 76 compressed to 40.2% MATCH ------------------------------ Compression by 2 bits per letter length of data 189 length of compressed data 48 compressed to 25.4% MATCH ------------------------------ Compression by groups of 5,2,1 length of data 189 length of compressed data 39 compressed to 20.6% MATCH [download] I tried gzip on the groups 5,2,1 compression string and gzip made it longer :)	[reply] [d/l] [select]
Re^2: How to efficently pack a string of 63 characters by LanX (Saint) on Sep 09, 2021 at 22:44 UTC
> gzip/gunzip ...compressed to 40.7% I'm pretty sure that zip has some fix overhead which doesn't pay off with just 189 bytes input. ~~> `my @code = map glob('{A,B,C}'x $_), 5, 2, 1;`~~ ~~I haven't run your code but it looks like you are mapping 9 possible chunks to a character needing a byte.~~ ~~Looks like you are wasting space. Already a naive 4 bit per chunk approach, i.e 1 byte for two chunks would double your efficiency.� (Needless to say, Huffman even more)~~ (Update: sorry I totally misunderstood how your glob works) FWIW I doubt the input sample is realistic. Looks handmade by copying and altering the first line, hence the high redundancy. :) Cheers Rolf _{(addicted to the Perl Programming Language :) Wikisyntax for the Monastery}	[reply] [d/l]
Re^3: How to efficently pack a string of 63 characters by tybalt89 (Monsignor) on Sep 09, 2021 at 23:39 UTC
I think you might be misreading the glob. The 5,2,1 maps 5 letters to one byte (look at the mapping hash with Data::Dump), so I'm getting a 5 to 1 reduction. (Ignoring the 2,1 which is just there for strings whose length is not a multiple of 5.) I don't understand your use of 'chunks', or where you get 9 of them. BTW: It's a 5 to 1 reduction independent of the redundancy in the string. Other compressors may use redundancy to do better. It's sort of a question of how random the letters really are.	[reply]
Re^4: How to efficently pack a string of 63 characters by LanX (Saint) on Sep 10, 2021 at 07:45 UTC
Re^3: How to efficently pack a string of 63 characters (longer input) by LanX (Saint) on Sep 10, 2021 at 11:32 UTC
> I'm pretty sure that zip has some fix overhead which doesn't pay off with just 189 bytes input. to prove my point, here an altered version of your code which fakes longer data by rotating the original input and showing zip at 5% compression. That's factor 4 better than your champion. (Of course is rotating kind of biased, because it keeps most run length chunks intact and zip will efficiently Huffmann all Runs it finds.� But it's up to the OP to provide unbiased data, I'm no psychic... ;-) FWIW: zip is already at 18% after only tripling the input. use strict; # https://perlmonks.org/?node_id=11136582 use warnings; # originally https://perlmonks.org/?displ +aytype=displaycode;node_id=11136614;abspart=1;part=1 use feature 'say'; for my $try ( [ 'gzip/gunzip', \&compgzip, \&uncompgzip ], [ '2 bit code, 6 bit runlength', \&comp62, \&uncomp62 ], [ '2 bits per letter', \&comp2bits, \&uncomp2bits ], [ 'groups of 5,2,1', \&comp5, \&uncomp5 ], ) { my ($method, $comp, $uncomp) = @$try; print "\n------------------------------ Compression by $method\n\n +"; my $data = <<END; ABBCBCAAAAABBCBCACCCAAAAACAAAAABBBBBAAAAABBAAAAAAAABBCCCACCAABC BCCCBCAACAABBBCAAACCAAAAACAAAAABBBBBAAAAABBAAAAAAAABBCCCACCABBC ABCCBBBAAAABBABCACABCCCCCCAAAAABBCBBCCCCAAAAAAAAAAAAACCCACCACCC END my @data = split/\n/, $data; push @data, map{rotate($data[$_])} 0..2 for 1..100; # fake x1 +00 times data $data = join "", @data; # remove \n they can be re-inserted later print "length of data @{[ $data =~ tr/ABC// ]}\n"; my $compressed = $comp->($data); print "length of compressed data @{[ length $compressed ]}\n"; #use Data::Dump 'dd'; dd $compressed; # print unpack('H', $compressed), "\n"; my $uncompressed = $uncomp->($compressed); printf "compressed to %.1f%%\n", 100 length($compressed) / length $uncompressed; print $data eq $uncompressed ? "MATCH" : "************ no MATCH", +"\n"; } # by lanx sub rotate { # fake data from original samp +le # say "O: ", my $orig =shift; # say my $rnd = int rand 63; my $head = substr $orig,0,$rnd,""; # say "N: ", my $new = $orig.$head; return $new; } # compress by groups of 5,2,1 to single letter sub comp5 { my @code = map glob('{A,B,C}' x $_), 5, 2, 1; my %code; @code{@code} = map chr, 1 .. @code; local $" = '\|'; shift =~ s/(@code)/$code{$1}/gr } sub uncomp5 { my @code = map glob('{A,B,C}' x $_), 5, 2, 1; my %code; @code{map chr, 1 .. @code} = @code; join '', @code{split //, shift}; } # compress by lower two bits of letter sub comp2bits { my ($ans, $n) = ('', 0); vec($ans, $n++, 2) = 3 & ord $_ for split //, shift; $ans; } sub uncomp2bits { my $comp = shift; join '', map { ('', 'A', 'B', 'C')[ vec $comp, $_, 2] } 0 .. -1 + 4 * length $comp; } # compress by runlength or 6 bits length and 2 bits letter code sub comp62 { shift =~ s/([ABC])\1{0,62}/ chr( length($&) << 2 \| ord($1) & 3) +/ger; } sub uncomp62 { shift =~ s/./ (($& & "\3") \| '@') x (ord($&) >> 2) /gesr; } # compress by gzip use IO::Compress::Gzip qw(gzip); sub compgzip { gzip \(shift) => \(my $output); $output; } use IO::Uncompress::Gunzip qw(gunzip); sub uncompgzip { gunzip \(shift) => \(my $output); $output; } [download] -- mode: compilation; default-directory: "d:/tmp/pm/" -- Compilation started at Fri Sep 10 13:25:18 C:/Strawberry/perl/bin\perl.exe -w d:/tmp/pm/pack_63_chars.pl ------------------------------ Compression by gzip/gunzip length of data 19089 length of compressed data 1021 compressed to 5.3% MATCH ------------------------------ Compression by 2 bit code, 6 bit runlen +gth length of data 19089 length of compressed data 7714 compressed to 40.4% MATCH ------------------------------ Compression by 2 bits per letter length of data 19089 length of compressed data 4773 compressed to 25.0% MATCH ------------------------------ Compression by groups of 5,2,1 length of data 19089 length of compressed data 3819 compressed to 20.0% MATCH Compilation finished at Fri Sep 10 13:25:18 [download] Cheers Rolf _{(addicted to the Perl Programming Language :) Wikisyntax for the Monastery} �) even after constantly reversing one part of the input I'm at 9% compression for factor 100 input.	[reply] [d/l] [select]
Re^4: a random_data() implementation by bliako (Abbot) on Sep 10, 2021 at 14:11 UTC
Re^5: a random_data() implementation by LanX (Saint) on Sep 10, 2021 at 15:41 UTC
Some notes below your chosen depth have not been shown here
Re^4: How to efficently pack a string of 63 characters (zip worst case) by LanX (Saint) on Sep 10, 2021 at 13:09 UTC
Re^5: How to efficently pack a string of 63 characters (zip worst case) by tybalt89 (Monsignor) on Sep 10, 2021 at 15:26 UTC
Re: How to efficently pack a string of 63 characters by jwkrahn (Abbot) on Sep 09, 2021 at 01:51 UTC
You could probably do something like this: $ echo "ABBCBCAAAAABBCBCACCCAAAAACAAAAABBBBBAAAAABBAAAAAAAABBCCCACCAAB +C BCCCBCAACAABBBCAAACCAAAAACAAAAABBBBBAAAAABBAAAAAAAABBCCCACCABBC ABCCBBBAAAABBABCACABCCCCCCAAAAABBCBBCCCCAAAAAAAAAAAAACCCACCACCC" \| per +l -le' use Data::Dumper; $Data::Dumper::Useqq = 1; my %alphabet = qw/ A 00 B 01 C 11 /; my %rev_alpha = reverse %alphabet +; while ( <> ) { chomp; print length(), " ", $_; # compress my $data = pack "B", join "", map $alphabet{ $_ }, split //; print length $data, " ", Dumper $data; # decompress my $uncp = join "", map $rev_alpha{ $_ }, map /../g, unpack "B", +$data; print length $uncp, " ", $uncp, "\n"; } ' 63 ABBCBCAAAAABBCBCACCCAAAAACAAAAABBBBBAAAAABBAAAAAAAABBCCCACCAABC 16 $VAR1 = "\27p\1w?\0000\1U\0\24\0\1\177<\34"; 64 ABBCBCAAAAABBCBCACCCAAAAACAAAAABBBBBAAAAABBAAAAAAAABBCCCACCAABCA 63 BCCCBCAACAABBBCAAACCAAAAACAAAAABBBBBAAAAABBAAAAAAAABBCCCACCABBC 16 $VAR1 = "\177p\301\\\17\0000\1U\0\24\0\1\177<\\"; 64 BCCCBCAACAABBBCAAACCAAAAACAAAAABBBBBAAAAABBAAAAAAAABBCCCACCABBCA 63 ABCCBBBAAAABBABCACABCCCCCCAAAAABBCBBCCCCAAAAAAAAAAAAACCCACCACCC 16 $VAR1 = "\37T\1G1\377\360\1u\377\0\0\0?<\374"; 64 ABCCBBBAAAABBABCACABCCCCCCAAAAABBCBBCCCCAAAAAAAAAAAAACCCACCACCCA [download] One of the problems with this is that "00" is a valid value so you get extra "A"s padding the right hand side of the string on decompression. If you change the alphabet to: `01 A 10 B 11 C` [download] It will decompress properly: $ echo "ABBCBCAAAAABBCBCACCCAAAAACAAAAABBBBBAAAAABBAAAAAAAABBCCCACCAAB +C BCCCBCAACAABBBCAAACCAAAAACAAAAABBBBBAAAAABBAAAAAAAABBCCCACCABBC ABCCBBBAAAABBABCACABCCCCCCAAAAABBCBBCCCCAAAAAAAAAAAAACCCACCACCC" \| per +l -le' use Data::Dumper; $Data::Dumper::Useqq = 1; my %alphabet = qw/ A 01 B 10 C 11 /; my %rev_alpha = reverse %alphabet +; while ( <> ) { chomp; print length(), " ", $_; # compress my $data = pack "B", join "", map $alphabet{ $_ }, split //; print length $data, " ", Dumper $data; # decompress my $uncp = join "", map $rev_alpha{ $_ }, map /../g, unpack "B", +$data; print length $uncp, " ", $uncp, "\n"; } ' 63 ABBCBCAAAAABBCBCACCCAAAAACAAAAABBBBBAAAAABBAAAAAAAABBCCCACCAABC 16 $VAR1 = "k\265V\273\177UuV\252UiUV\277}l"; 63 ABBCBCAAAAABBCBCACCCAAAAACAAAAABBBBBAAAAABBAAAAAAAABBCCCACCAABC 63 BCCCBCAACAABBBCAAACCAAAAACAAAAABBBBBAAAAABBAAAAAAAABBCCCACCABBC 16 $VAR1 = "\277\265\326\255_UuV\252UiUV\277}\254"; 63 BCCCBCAACAABBBCAAACCAAAAACAAAAABBBBBAAAAABBAAAAAAAABBCCCACCABBC 63 ABCCBBBAAAABBABCACABCCCCCCAAAAABBCBBCCCCAAAAAAAAAAAAACCCACCACCC 16 $VAR1 = "o\251V\233v\377\365V\272\377UUU\177}\374"; 63 ABCCBBBAAAABBABCACABCCCCCCAAAAABBCBBCCCCAAAAAAAAAAAAACCCACCACCC [download]	[reply] [d/l] [select]
Re: How to efficently pack a string of 63 characters by bliako (Abbot) on Sep 09, 2021 at 06:51 UTC
One can assume a random distribution of characters from Alphabet A = {A, B, C} Uniformly random? If they follow another distribution (e.g. if A is more likely than B) you can get an advantage. Investigating the distribution of A,B,C can be insightful in deciding whether 6 bits for RL coding is too much. I can see that you get long sequences like `AAAAAAAAAAAAA` which can utilise a lot of RL bits, but what's the percentage of unused RL bits? For which characters? `3^63 = 1144561273430837494885949696427` are all these combinations equally likely? Or even possible? If not, you can get an advantage. Your data is similar to gene sequences for which compression is a problem. See Compression_of_genomic_sequencing_data for ideas. bw, bliako	[reply] [d/l] [select]
Re: How to efficently pack a string of 63 characters by LanX (Saint) on Sep 08, 2021 at 20:48 UTC
If the sample is real and representative, then it looks like you have lots of similarities between two lines, akin to image data. I'd try: calculate the deltas between two lines like C-A=2 do a run length encoding of those deltas , like (2,5),(0,20),... probably try Huffman code the run length pairs. You can skip step 3 and try direct zipping tho after step 2, because Huffman is part of zip IIRC. NB: feeding binary data into zip might be contra productive. Better a "readable" text where the patterns are visible. Cheers Rolf _{(addicted to the Perl Programming Language :) Wikisyntax for the Monastery}	[reply]
Re^2: How to efficently pack a string of 63 characters by LanX (Saint) on Sep 11, 2021 at 14:11 UTC
> calculate the deltas between two lines like C-A=2 OK I tried this and fed the result into IO::Compress::Gzip , in all scenarios this only yeld at most one percent in gain. I'd say gzip is the best choice here, it compressed between 4% and 24% (worst case) without any prior knowledge. And 19.81% is the optimum for the worst case of total entropy. I also tried providing extra flags for best compression and minimal head, but the benefit was again only 1% at most. And this module is in core. `>corelist IO::Compress::Gzip Data for 2021-01-23 IO::Compress::Gzip was first released with perl v5.9.4` [download] I think the decision is a no brainer ... Cheers Rolf _{(addicted to the Perl Programming Language :) Wikisyntax for the Monastery}	[reply] [d/l]
Re^3: How to efficently pack a string of 63 characters by tybalt89 (Monsignor) on Sep 11, 2021 at 15:09 UTC
In my testing I also tried IO::Compress::Bzip2 which was always slightly better than Gzip and is also in core. `corelist IO::Compress::Bzip2 Data for 2021-05-20 IO::Compress::Bzip2 was first released with perl v5.10.1` [download]	[reply] [d/l]
Re^4: How to efficently pack a string of 63 characters by LanX (Saint) on Sep 11, 2021 at 16:44 UTC
Re: How to efficently pack a string of 63 characters by Discipulus (Canon) on Sep 09, 2021 at 12:12 UTC
Hello baxy77bax, sorry for my naive suggestion, infact is not at all my field (well.. what is my field? ;) but what about something like: use strict; use warnings; my @str = qw( ABBCBCAAAAABBCBCACCCAAAAACAAAAABBBBBAAAAABBAAAAAAAABBCCCACCAABC BCCCBCAACAABBBCAAACCAAAAACAAAAABBBBBAAAAABBAAAAAAAABBCCCACCABBC ABCCBBBAAAABBABCACABCCCCCCAAAAABBCBBCCCCAAAAAAAAAAAAACCCACCACCC ); foreach my $str ( @str ){ print "$str\n", $str =~ s/([A-C])(\1{2,})/$1.(1+length($2))/reg, "\n\n"; } __END__ ABBCBCAAAAABBCBCACCCAAAAACAAAAABBBBBAAAAABBAAAAAAAABBCCCACCAABC ABBCBCA5BBCBCAC3A5CA5B5A5BBA8BBC3ACCAABC BCCCBCAACAABBBCAAACCAAAAACAAAAABBBBBAAAAABBAAAAAAAABBCCCACCABBC BC3BCAACAAB3CA3CCA5CA5B5A5BBA8BBC3ACCABBC ABCCBBBAAAABBABCACABCCCCCCAAAAABBCBBCCCCAAAAAAAAAAAAACCCACCACCC ABCCB3A4BBABCACABC6A5BBCBBC4A13C3ACCAC3 [download] L* There are no rules, there are no thumbs.. Reinvent the wheel, then learn The Wheel; may be one day you reinvent one of THE WHEELS.	[reply] [d/l]
Re^2: How to efficently pack a string of 63 characters by LanX (Saint) on Sep 09, 2021 at 12:54 UTC
Yes but runlength encoding (+ Huffman) is already part of zip See http://www.zlib.net/feldspar.html The only chance to improve zip is to compress redundancy which isn't found by zip while providing an input which is character based (which you are doing) That's why I suggested a line delta compression first in my other answer, provided the example was real (which is not a given with this OP) See Re: How to efficently pack a string of 63 characters (There is a chance tho that zip is clever enough to notice such repeating groups too. But that kind of stuff is normally part of image not text compression) Cheers Rolf _{(addicted to the Perl Programming Language :) Wikisyntax for the Monastery}	[reply]
Re: How to efficently pack a string of 63 characters by Anonymous Monk on Sep 08, 2021 at 20:58 UTC
Ironically, you might reduce the efficiency of the compression, because you have squeezed out the "redundancy" upon which text compression algorithms depend. Simply stated, these algorithms recognize "a sequence that we have recently seen before," and replace them with a coded reference to that sequence. Your pre-processing might remove that redundancy, reducing or even eliminating the chance to substitute codes. However, there's really only one way to find out: an empirical test. Generate a few thousand bits of plausibly representative data (unless you have the real stuff at hand), and try it both ways.	[reply]
Re^2: How to efficently pack a string of 63 characters by LanX (Saint) on Sep 08, 2021 at 21:18 UTC
True but "hand optimized" might be more efficient than automatically packed, because human heuristic might find unusual patterns. We will only be able to tell after the OP provided us with more typical sample data. Cheers Rolf _{(addicted to the Perl Programming Language :) Wikisyntax for the Monastery}	[reply]
Re: How to efficently pack a string of 63 characters by Marshall (Canon) on Sep 10, 2021 at 13:23 UTC
From your problem statement, you have not only just ASCII characters but only upper case ASCII, not binary data. I would go "old school" and use the Baudot_code. This gets each 8 bit byte to 5 bits. Shifting between upper and lower case can be done but requires an additional "control" character. The advantage of this is that this is a well known standard that is still in use today. There is nothing tricky or fancy about this. That means that it can be done very efficiently. 8 bits -> 5 bits according to a standard protocol. Update: That means that 3 of every 8 bits are not needed. 3/8 means, 37.5% compression for almost nothing. Also if this is true: "One can assume a random distribution of characters from Alphabet A = {A, B, C}". There is not much that you can do past baudot code. Random data cannot be compressed. Oh, I perhaps see now, the alphabet will be limited to a set less than 26 characters. In that case, perhaps make your own version of the Baudot code? Perhaps send a block of translation tables before the actual data? It does appear to me that you can achieve a huge amount of compression with a table driven, very efficient algorithm.	[reply]