Re: How to efficently pack a string of 63 characters
by GrandFather (Saint) on Sep 08, 2021 at 21:00 UTC
|
Essentially, you can't compress compressed data. There are tricks that can help with some data, but probably not with the data you present. Something like what you suggest can work well for digitally sampled analog data where most of the information is in the low bits so compressing the high bits of the samples separately can be advantageous. Doing that where all the high bits are the same for the entire data set only bulks out the final result.
Optimising for fewest key strokes only makes sense transmitting to Pluto or beyond
| [reply] |
Re: How to efficently pack a string of 63 characters
by roboticus (Chancellor) on Sep 09, 2021 at 14:11 UTC
|
baxy77bax:
The amount of compression you can obtain is directly related to the amount of information in the data. The more constraints the data has, the less information it contains, and so the more you can compress it.
You've mentioned that you have an alphabet of three characters (A, B, C) and a random distribution. If you mean a uniform random distribution (i.e., all letters having a probability of 1/3), then each character in your string takes 1.585 bits to encode[1]. But your sample data shows that of 189 characters, A occurs 93 times, B occurs 44 times and C occurs 52 times. If that represents the actual distribution of characters, then you need only 1.503 bits to encode a character, allowing you to squeeze a bit more out of your 10K strings.
If you're able learn more constraints on the data (i.e., rules that valid data must follow), then you can squeeze the data even more. So if you want people to help you compress the data even further, we need to know more about your data if you want to get better compression. Note that *any* rules about the data can be relevant: so don't hold back! As an example, if each line was "similar" to the previous line by some simple rule, it may give us the information needed to
As an example, a couple years ago, you asked a similar question, and by looking over the code used to generate the data, we were able to determine that we could encode your 90 character strings in 51 bits, giving about 0.567 bits per character.
NOTES:
[1]: I used Shannon's Entropy formula to compute the number of bits per symbol.
EDIT: Fixed the wikipedia link as noted by fellow monks AnomalousMonk and LanX.
...roboticus
When your only tool is a hammer, all problems look like your thumb.
| [reply] |
|
|
| [reply] [d/l] |
|
|
| [reply] [d/l] [select] |
|
|
| [reply] |
|
|
> When I say random i always assume sampling from a uniform dist.
The shown sample is obviously far from random or uniformly distributed
> > > Example:
ABBCBCAAAAABBCBCACCCAAAAACAAAAABBBBBAAAAABBAAAAAAAABBCCCACCAABC
BCCCBCAACAABBBCAAACCAAAAACAAAAABBBBBAAAAABBAAAAAAAABBCCCACCABBC
ABCCBBBAAAABBABCACABCCCCCCAAAAABBCBBCCCCAAAAAAAAAAAAACCCACCACCC
...
Please show us some real input. And more of it.
Otherwise, if that's real input, please realize it's very redundant.
Otherwise couldn't compress such a distribution to 5% already.
| [reply] [d/l] |
Re: How to efficently pack a string of 63 characters
by tybalt89 (Monsignor) on Sep 09, 2021 at 18:47 UTC
|
Just for fun, here's a few different ways to do compression. Note that I assume you have a longer file with many lines that are 63 chars each. I remove the
newlines (they can always be replaced later) because some compression algorithms can do better with longer strings of data, and the newline introduces a fourth character to the mix.
Have fun !
#!/usr/bin/perl
use strict; # https://perlmonks.org/?node_id=11136582
use warnings;
for my $try (
[ 'gzip/gunzip', \&compgzip, \&uncompgzip ],
[ '2 bit code, 6 bit runlength', \&comp62, \&uncomp62 ],
[ '2 bits per letter', \&comp2bits, \&uncomp2bits ],
[ 'groups of 5,2,1', \&comp5, \&uncomp5 ],
)
{
my ($method, $comp, $uncomp) = @$try;
print "\n------------------------------ Compression by $method\n\n";
my $data = <<END;
ABBCBCAAAAABBCBCACCCAAAAACAAAAABBBBBAAAAABBAAAAAAAABBCCCACCAABC
BCCCBCAACAABBBCAAACCAAAAACAAAAABBBBBAAAAABBAAAAAAAABBCCCACCABBC
ABCCBBBAAAABBABCACABCCCCCCAAAAABBCBBCCCCAAAAAAAAAAAAACCCACCACCC
END
$data =~ tr/\n//d; # remove \n they can be re-inserted later
print "length of data @{[ $data =~ tr/ABC// ]}\n";
my $compressed = $comp->($data);
print "length of compressed data @{[ length $compressed ]}\n";
#use Data::Dump 'dd'; dd $compressed;
# print unpack('H*', $compressed), "\n";
my $uncompressed = $uncomp->($compressed);
printf "compressed to %.1f%%\n",
100 * length($compressed) / length $uncompressed;
print $data eq $uncompressed ? "MATCH" : "************ no MATCH", "\
+n";
}
# compress by groups of 5,2,1 to single letter
sub comp5
{
my @code = map glob('{A,B,C}' x $_), 5, 2, 1;
my %code;
@code{@code} = map chr, 1 .. @code;
local $" = '|';
shift =~ s/(@code)/$code{$1}/gr
}
sub uncomp5
{
my @code = map glob('{A,B,C}' x $_), 5, 2, 1;
my %code;
@code{map chr, 1 .. @code} = @code;
join '', @code{split //, shift};
}
# compress by lower two bits of letter
sub comp2bits
{
my ($ans, $n) = ('', 0);
vec($ans, $n++, 2) = 3 & ord $_ for split //, shift;
$ans;
}
sub uncomp2bits
{
my $comp = shift;
join '', map
{
('', 'A', 'B', 'C')[ vec $comp, $_, 2]
} 0 .. -1 + 4 * length $comp;
}
# compress by runlength or 6 bits length and 2 bits letter code
sub comp62
{
shift =~ s/([ABC])\1{0,62}/ chr( length($&) << 2 | ord($1) & 3) /ger
+;
}
sub uncomp62
{
shift =~ s/./ (($& & "\3") | '@') x (ord($&) >> 2) /gesr;
}
# compress by gzip
use IO::Compress::Gzip qw(gzip);
sub compgzip
{
gzip \(shift) => \(my $output);
$output;
}
use IO::Uncompress::Gunzip qw(gunzip);
sub uncompgzip
{
gunzip \(shift) => \(my $output);
$output;
}
Outputs:
------------------------------ Compression by gzip/gunzip
length of data 189
length of compressed data 77
compressed to 40.7%
MATCH
------------------------------ Compression by 2 bit code, 6 bit runlen
+gth
length of data 189
length of compressed data 76
compressed to 40.2%
MATCH
------------------------------ Compression by 2 bits per letter
length of data 189
length of compressed data 48
compressed to 25.4%
MATCH
------------------------------ Compression by groups of 5,2,1
length of data 189
length of compressed data 39
compressed to 20.6%
MATCH
I tried gzip on the groups 5,2,1 compression string and gzip made it longer :)
| [reply] [d/l] [select] |
|
|
> gzip/gunzip ...compressed to 40.7%
I'm pretty sure that zip has some fix overhead which doesn't pay off with just 189 bytes input.
> my @code = map glob('{A,B,C}'x $_), 5, 2, 1;
I haven't run your code but it looks like you are mapping 9 possible chunks to a character needing a byte.
Looks like you are wasting space. Already a naive 4 bit per chunk approach, i.e 1 byte for two chunks would double your efficiency.° (Needless to say, Huffman even more)
(Update: sorry I totally misunderstood how your glob works)
FWIW I doubt the input sample is realistic. Looks handmade by copying and altering the first line, hence the high redundancy. :)
| [reply] [d/l] |
|
|
I think you might be misreading the glob.
The 5,2,1 maps 5 letters to one byte (look at the mapping hash with Data::Dump), so I'm getting a 5 to 1 reduction.
(Ignoring the 2,1 which is just there for strings whose length is not a multiple of 5.)
I don't understand your use of 'chunks', or where you get 9 of them.
BTW: It's a 5 to 1 reduction independent of the redundancy in the string. Other compressors may use redundancy to do better. It's sort of
a question of how random the letters really are.
| [reply] |
|
|
|
|
> I'm pretty sure that zip has some fix overhead which doesn't pay off with just 189 bytes input.
to prove my point, here an altered version of your code which fakes longer data by rotating the original input and showing zip at 5% compression. That's factor 4 better than your champion.
(Of course is rotating kind of biased, because it keeps most run length chunks intact and zip will efficiently Huffmann all Runs it finds.°
But it's up to the OP to provide unbiased data, I'm no psychic... ;-)
FWIW: zip is already at 18% after only tripling the input.
use strict; # https://perlmonks.org/?node_id=11136582
use warnings; # originally https://perlmonks.org/?displ
+aytype=displaycode;node_id=11136614;abspart=1;part=1
use feature 'say';
for my $try (
[ 'gzip/gunzip', \&compgzip, \&uncompgzip ],
[ '2 bit code, 6 bit runlength', \&comp62, \&uncomp62 ],
[ '2 bits per letter', \&comp2bits, \&uncomp2bits ],
[ 'groups of 5,2,1', \&comp5, \&uncomp5 ],
) {
my ($method, $comp, $uncomp) = @$try;
print "\n------------------------------ Compression by $method\n\n
+";
my $data = <<END;
ABBCBCAAAAABBCBCACCCAAAAACAAAAABBBBBAAAAABBAAAAAAAABBCCCACCAABC
BCCCBCAACAABBBCAAACCAAAAACAAAAABBBBBAAAAABBAAAAAAAABBCCCACCABBC
ABCCBBBAAAABBABCACABCCCCCCAAAAABBCBBCCCCAAAAAAAAAAAAACCCACCACCC
END
my @data = split/\n/, $data;
push @data, map{rotate($data[$_])} 0..2 for 1..100; # fake x1
+00 times data
$data = join "", @data; # remove \n they can be re-inserted later
print "length of data @{[ $data =~ tr/ABC// ]}\n";
my $compressed = $comp->($data);
print "length of compressed data @{[ length $compressed ]}\n";
#use Data::Dump 'dd'; dd $compressed;
# print unpack('H*', $compressed), "\n";
my $uncompressed = $uncomp->($compressed);
printf "compressed to %.1f%%\n",
100 * length($compressed) / length $uncompressed;
print $data eq $uncompressed ? "MATCH" : "************ no MATCH",
+"\n";
}
# by lanx
sub rotate { # fake data from original samp
+le
# say "O: ",
my $orig =shift;
# say
my $rnd = int rand 63;
my $head = substr $orig,0,$rnd,"";
# say "N: ",
my $new = $orig.$head;
return $new;
}
# compress by groups of 5,2,1 to single letter
sub comp5
{
my @code = map glob('{A,B,C}' x $_), 5, 2, 1;
my %code;
@code{@code} = map chr, 1 .. @code;
local $" = '|';
shift =~ s/(@code)/$code{$1}/gr
}
sub uncomp5
{
my @code = map glob('{A,B,C}' x $_), 5, 2, 1;
my %code;
@code{map chr, 1 .. @code} = @code;
join '', @code{split //, shift};
}
# compress by lower two bits of letter
sub comp2bits
{
my ($ans, $n) = ('', 0);
vec($ans, $n++, 2) = 3 & ord $_ for split //, shift;
$ans;
}
sub uncomp2bits
{
my $comp = shift;
join '', map
{
('', 'A', 'B', 'C')[ vec $comp, $_, 2]
} 0 .. -1 + 4 * length $comp;
}
# compress by runlength or 6 bits length and 2 bits letter code
sub comp62
{
shift =~ s/([ABC])\1{0,62}/ chr( length($&) << 2 | ord($1) & 3)
+/ger;
}
sub uncomp62
{
shift =~ s/./ (($& & "\3") | '@') x (ord($&) >> 2) /gesr;
}
# compress by gzip
use IO::Compress::Gzip qw(gzip);
sub compgzip
{
gzip \(shift) => \(my $output);
$output;
}
use IO::Uncompress::Gunzip qw(gunzip);
sub uncompgzip
{
gunzip \(shift) => \(my $output);
$output;
}
-*- mode: compilation; default-directory: "d:/tmp/pm/" -*-
Compilation started at Fri Sep 10 13:25:18
C:/Strawberry/perl/bin\perl.exe -w d:/tmp/pm/pack_63_chars.pl
------------------------------ Compression by gzip/gunzip
length of data 19089
length of compressed data 1021
compressed to 5.3%
MATCH
------------------------------ Compression by 2 bit code, 6 bit runlen
+gth
length of data 19089
length of compressed data 7714
compressed to 40.4%
MATCH
------------------------------ Compression by 2 bits per letter
length of data 19089
length of compressed data 4773
compressed to 25.0%
MATCH
------------------------------ Compression by groups of 5,2,1
length of data 19089
length of compressed data 3819
compressed to 20.0%
MATCH
Compilation finished at Fri Sep 10 13:25:18
°) even after constantly reversing one part of the input I'm at 9% compression for factor 100 input. | [reply] [d/l] [select] |
|
|
|
|
|
|
|
|
|
Re: How to efficently pack a string of 63 characters
by jwkrahn (Abbot) on Sep 09, 2021 at 01:51 UTC
|
$ echo "ABBCBCAAAAABBCBCACCCAAAAACAAAAABBBBBAAAAABBAAAAAAAABBCCCACCAAB
+C
BCCCBCAACAABBBCAAACCAAAAACAAAAABBBBBAAAAABBAAAAAAAABBCCCACCABBC
ABCCBBBAAAABBABCACABCCCCCCAAAAABBCBBCCCCAAAAAAAAAAAAACCCACCACCC" | per
+l -le'
use Data::Dumper; $Data::Dumper::Useqq = 1;
my %alphabet = qw/ A 00 B 01 C 11 /; my %rev_alpha = reverse %alphabet
+;
while ( <> ) {
chomp;
print length(), " ", $_;
# compress
my $data = pack "B*", join "", map $alphabet{ $_ }, split //;
print length $data, " ", Dumper $data;
# decompress
my $uncp = join "", map $rev_alpha{ $_ }, map /../g, unpack "B*",
+$data;
print length $uncp, " ", $uncp, "\n";
}
'
63 ABBCBCAAAAABBCBCACCCAAAAACAAAAABBBBBAAAAABBAAAAAAAABBCCCACCAABC
16 $VAR1 = "\27p\1w?\0000\1U\0\24\0\1\177<\34";
64 ABBCBCAAAAABBCBCACCCAAAAACAAAAABBBBBAAAAABBAAAAAAAABBCCCACCAABCA
63 BCCCBCAACAABBBCAAACCAAAAACAAAAABBBBBAAAAABBAAAAAAAABBCCCACCABBC
16 $VAR1 = "\177p\301\\\17\0000\1U\0\24\0\1\177<\\";
64 BCCCBCAACAABBBCAAACCAAAAACAAAAABBBBBAAAAABBAAAAAAAABBCCCACCABBCA
63 ABCCBBBAAAABBABCACABCCCCCCAAAAABBCBBCCCCAAAAAAAAAAAAACCCACCACCC
16 $VAR1 = "\37T\1G1\377\360\1u\377\0\0\0?<\374";
64 ABCCBBBAAAABBABCACABCCCCCCAAAAABBCBBCCCCAAAAAAAAAAAAACCCACCACCCA
One of the problems with this is that "00" is a valid value so you get extra "A"s padding the right hand side of the string on decompression.
If you change the alphabet to:
01 A
10 B
11 C
It will decompress properly:
$ echo "ABBCBCAAAAABBCBCACCCAAAAACAAAAABBBBBAAAAABBAAAAAAAABBCCCACCAAB
+C
BCCCBCAACAABBBCAAACCAAAAACAAAAABBBBBAAAAABBAAAAAAAABBCCCACCABBC
ABCCBBBAAAABBABCACABCCCCCCAAAAABBCBBCCCCAAAAAAAAAAAAACCCACCACCC" | per
+l -le'
use Data::Dumper; $Data::Dumper::Useqq = 1;
my %alphabet = qw/ A 01 B 10 C 11 /; my %rev_alpha = reverse %alphabet
+;
while ( <> ) {
chomp;
print length(), " ", $_;
# compress
my $data = pack "B*", join "", map $alphabet{ $_ }, split //;
print length $data, " ", Dumper $data;
# decompress
my $uncp = join "", map $rev_alpha{ $_ }, map /../g, unpack "B*",
+$data;
print length $uncp, " ", $uncp, "\n";
}
'
63 ABBCBCAAAAABBCBCACCCAAAAACAAAAABBBBBAAAAABBAAAAAAAABBCCCACCAABC
16 $VAR1 = "k\265V\273\177UuV\252UiUV\277}l";
63 ABBCBCAAAAABBCBCACCCAAAAACAAAAABBBBBAAAAABBAAAAAAAABBCCCACCAABC
63 BCCCBCAACAABBBCAAACCAAAAACAAAAABBBBBAAAAABBAAAAAAAABBCCCACCABBC
16 $VAR1 = "\277\265\326\255_UuV\252UiUV\277}\254";
63 BCCCBCAACAABBBCAAACCAAAAACAAAAABBBBBAAAAABBAAAAAAAABBCCCACCABBC
63 ABCCBBBAAAABBABCACABCCCCCCAAAAABBCBBCCCCAAAAAAAAAAAAACCCACCACCC
16 $VAR1 = "o\251V\233v\377\365V\272\377UUU\177}\374";
63 ABCCBBBAAAABBABCACABCCCCCCAAAAABBCBBCCCCAAAAAAAAAAAAACCCACCACCC
| [reply] [d/l] [select] |
Re: How to efficently pack a string of 63 characters
by bliako (Abbot) on Sep 09, 2021 at 06:51 UTC
|
One can assume a random distribution of characters from Alphabet A = {A, B, C}
Uniformly random? If they follow another distribution (e.g. if A is more likely than B) you can get an advantage.
Investigating the distribution of A,B,C can be insightful in deciding whether 6 bits for RL coding is too much. I can see that you get long sequences like AAAAAAAAAAAAA which can utilise a lot of RL bits, but what's the percentage of unused RL bits? For which characters?
3^63 = 1144561273430837494885949696427 are all these combinations equally likely? Or even possible? If not, you can get an advantage.
Your data is similar to gene sequences for which compression is a problem. See Compression_of_genomic_sequencing_data for ideas.
bw, bliako | [reply] [d/l] [select] |
Re: How to efficently pack a string of 63 characters
by LanX (Saint) on Sep 08, 2021 at 20:48 UTC
|
If the sample is real and representative, then it looks like you have lots of similarities between two lines, akin to image data.
I'd try:
- calculate the deltas between two lines like C-A=2
- do a run length encoding of those deltas
, like (2,5),(0,20),...
- probably try Huffman code the run length pairs.
You can skip step 3 and try direct zipping tho after step 2, because Huffman is part of zip IIRC.
NB: feeding binary data into zip might be contra productive. Better a "readable" text where the patterns are visible.
| [reply] |
|
|
> calculate the deltas between two lines like C-A=2
OK I tried this and fed the result into IO::Compress::Gzip , in all scenarios this only yeld at most one percent in gain.
I'd say gzip is the best choice here, it compressed between 4% and 24% (worst case) without any prior knowledge. And 19.81% is the optimum for the worst case of total entropy.
I also tried providing extra flags for best compression and minimal head, but the benefit was again only 1% at most.
And this module is in core.
>corelist IO::Compress::Gzip
Data for 2021-01-23
IO::Compress::Gzip was first released with perl v5.9.4
I think the decision is a no brainer ...
| [reply] [d/l] |
|
|
corelist IO::Compress::Bzip2
Data for 2021-05-20
IO::Compress::Bzip2 was first released with perl v5.10.1
| [reply] [d/l] |
|
|
Re: How to efficently pack a string of 63 characters
by Discipulus (Canon) on Sep 09, 2021 at 12:12 UTC
|
Hello baxy77bax,
sorry for my naive suggestion, infact is not at all my field (well.. what is my field? ;) but what about something like:
use strict;
use warnings;
my @str = qw(
ABBCBCAAAAABBCBCACCCAAAAACAAAAABBBBBAAAAABBAAAAAAAABBCCCACCAABC
BCCCBCAACAABBBCAAACCAAAAACAAAAABBBBBAAAAABBAAAAAAAABBCCCACCABBC
ABCCBBBAAAABBABCACABCCCCCCAAAAABBCBBCCCCAAAAAAAAAAAAACCCACCACCC
);
foreach my $str ( @str ){
print "$str\n",
$str =~ s/([A-C])(\1{2,})/$1.(1+length($2))/reg,
"\n\n";
}
__END__
ABBCBCAAAAABBCBCACCCAAAAACAAAAABBBBBAAAAABBAAAAAAAABBCCCACCAABC
ABBCBCA5BBCBCAC3A5CA5B5A5BBA8BBC3ACCAABC
BCCCBCAACAABBBCAAACCAAAAACAAAAABBBBBAAAAABBAAAAAAAABBCCCACCABBC
BC3BCAACAAB3CA3CCA5CA5B5A5BBA8BBC3ACCABBC
ABCCBBBAAAABBABCACABCCCCCCAAAAABBCBBCCCCAAAAAAAAAAAAACCCACCACCC
ABCCB3A4BBABCACABC6A5BBCBBC4A13C3ACCAC3
L*
There are no rules, there are no thumbs..
Reinvent the wheel, then learn The Wheel; may be one day you reinvent one of THE WHEELS.
| [reply] [d/l] |
|
|
Yes but runlength encoding (+ Huffman) is already part of zip
See
http://www.zlib.net/feldspar.html
The only chance to improve zip is to compress redundancy which isn't found by zip while providing an input which is character based (which you are doing)
That's why I suggested a line delta compression first in my other answer, provided the example was real (which is not a given with this OP)
See Re: How to efficently pack a string of 63 characters
(There is a chance tho that zip is clever enough to notice such repeating groups too. But that kind of stuff is normally part of image not text compression)
| [reply] |
Re: How to efficently pack a string of 63 characters
by Anonymous Monk on Sep 08, 2021 at 20:58 UTC
|
Ironically, you might reduce the efficiency of the compression, because you have squeezed out the "redundancy" upon which text compression algorithms depend. Simply stated, these algorithms recognize "a sequence that we have recently seen before," and replace them with a coded reference to that sequence. Your pre-processing might remove that redundancy, reducing or even eliminating the chance to substitute codes. However, there's really only one way to find out: an empirical test. Generate a few thousand bits of plausibly representative data (unless you have the real stuff at hand), and try it both ways. | [reply] |
|
|
| [reply] |
Re: How to efficently pack a string of 63 characters
by Marshall (Canon) on Sep 10, 2021 at 13:23 UTC
|
From your problem statement, you have not only just ASCII characters but only upper case ASCII, not binary data.
I would go "old school" and use the Baudot_code. This gets each 8 bit byte to 5 bits. Shifting between upper and lower case can be done but requires an additional "control" character.
The advantage of this is that this is a well known standard that is still in use today. There is nothing tricky or fancy about this. That means that it can be done very efficiently. 8 bits -> 5 bits according to a standard protocol.
Update: That means that 3 of every 8 bits are not needed. 3/8 means, 37.5% compression for almost nothing.
Also if this is true: "One can assume a random distribution of characters from Alphabet A = {A, B, C}". There is not much that you can do past baudot code. Random data cannot be compressed. Oh, I perhaps see now, the alphabet will be limited to a set less than 26 characters. In that case, perhaps make your own version of the Baudot code? Perhaps send a block of translation tables before the actual data? It does appear to me that you can achieve a huge amount of compression with a table driven, very efficient algorithm. | [reply] |