in reply to Re: How to efficently pack a string of 63 characters
in thread How to efficently pack a string of 63 characters

> gzip/gunzip ...compressed to 40.7%

I'm pretty sure that zip has some fix overhead which doesn't pay off with just 189 bytes input.

> my @code = map glob('{A,B,C}'x $_), 5, 2, 1;

I haven't run your code but it looks like you are mapping 9 possible chunks to a character needing a byte.

Looks like you are wasting space. Already a naive 4 bit per chunk approach, i.e 1 byte for two chunks would double your efficiency.° (Needless to say, Huffman even more) (Update: sorry I totally misunderstood how your glob works)

FWIW I doubt the input sample is realistic. Looks handmade by copying and altering the first line, hence the high redundancy. :)

Cheers Rolf
(addicted to the Perl Programming Language :)
Wikisyntax for the Monastery

Replies are listed 'Best First'.
Re^3: How to efficently pack a string of 63 characters
by tybalt89 (Monsignor) on Sep 09, 2021 at 23:39 UTC

    I think you might be misreading the glob.

    The 5,2,1 maps 5 letters to one byte (look at the mapping hash with Data::Dump), so I'm getting a 5 to 1 reduction. (Ignoring the 2,1 which is just there for strings whose length is not a multiple of 5.)

    I don't understand your use of 'chunks', or where you get 9 of them.

    BTW: It's a 5 to 1 reduction independent of the redundancy in the string. Other compressors may use redundancy to do better. It's sort of a question of how random the letters really are.

      Yes I misread the glob like producing 9 runlength chunks 'AAAAA','AA','A',... and coding each as a byte.

      I agree that this 5 to 1 is almost optimal, if the alphabet is really random, i.e without redundancy.

      3**5=243 that means you are using 7.92 bits of the byte. Plus some more for smaller trailing chunks.

      That's very efficient. The theoretical optimum is at 37.5 bytes and you only need 39.

      But I think zip should do considerably better than 40% if this particular raw input was longer. (update: like proven here)

      Cheers Rolf
      (addicted to the Perl Programming Language :)
      Wikisyntax for the Monastery

Re^3: How to efficently pack a string of 63 characters (longer input)
by LanX (Saint) on Sep 10, 2021 at 11:32 UTC
    > I'm pretty sure that zip has some fix overhead which doesn't pay off with just 189 bytes input.

    to prove my point, here an altered version of your code which fakes longer data by rotating the original input and showing zip at 5% compression. That's factor 4 better than your champion.

    (Of course is rotating kind of biased, because it keeps most run length chunks intact and zip will efficiently Huffmann all Runs it finds.° But it's up to the OP to provide unbiased data, I'm no psychic... ;-)

    FWIW: zip is already at 18% after only tripling the input.

    use strict; # https://perlmonks.org/?node_id=11136582 use warnings; # originally https://perlmonks.org/?displ +aytype=displaycode;node_id=11136614;abspart=1;part=1 use feature 'say'; for my $try ( [ 'gzip/gunzip', \&compgzip, \&uncompgzip ], [ '2 bit code, 6 bit runlength', \&comp62, \&uncomp62 ], [ '2 bits per letter', \&comp2bits, \&uncomp2bits ], [ 'groups of 5,2,1', \&comp5, \&uncomp5 ], ) { my ($method, $comp, $uncomp) = @$try; print "\n------------------------------ Compression by $method\n\n +"; my $data = <<END; ABBCBCAAAAABBCBCACCCAAAAACAAAAABBBBBAAAAABBAAAAAAAABBCCCACCAABC BCCCBCAACAABBBCAAACCAAAAACAAAAABBBBBAAAAABBAAAAAAAABBCCCACCABBC ABCCBBBAAAABBABCACABCCCCCCAAAAABBCBBCCCCAAAAAAAAAAAAACCCACCACCC END my @data = split/\n/, $data; push @data, map{rotate($data[$_])} 0..2 for 1..100; # fake x1 +00 times data $data = join "", @data; # remove \n they can be re-inserted later print "length of data @{[ $data =~ tr/ABC// ]}\n"; my $compressed = $comp->($data); print "length of compressed data @{[ length $compressed ]}\n"; #use Data::Dump 'dd'; dd $compressed; # print unpack('H*', $compressed), "\n"; my $uncompressed = $uncomp->($compressed); printf "compressed to %.1f%%\n", 100 * length($compressed) / length $uncompressed; print $data eq $uncompressed ? "MATCH" : "************ no MATCH", +"\n"; } # by lanx sub rotate { # fake data from original samp +le # say "O: ", my $orig =shift; # say my $rnd = int rand 63; my $head = substr $orig,0,$rnd,""; # say "N: ", my $new = $orig.$head; return $new; } # compress by groups of 5,2,1 to single letter sub comp5 { my @code = map glob('{A,B,C}' x $_), 5, 2, 1; my %code; @code{@code} = map chr, 1 .. @code; local $" = '|'; shift =~ s/(@code)/$code{$1}/gr } sub uncomp5 { my @code = map glob('{A,B,C}' x $_), 5, 2, 1; my %code; @code{map chr, 1 .. @code} = @code; join '', @code{split //, shift}; } # compress by lower two bits of letter sub comp2bits { my ($ans, $n) = ('', 0); vec($ans, $n++, 2) = 3 & ord $_ for split //, shift; $ans; } sub uncomp2bits { my $comp = shift; join '', map { ('', 'A', 'B', 'C')[ vec $comp, $_, 2] } 0 .. -1 + 4 * length $comp; } # compress by runlength or 6 bits length and 2 bits letter code sub comp62 { shift =~ s/([ABC])\1{0,62}/ chr( length($&) << 2 | ord($1) & 3) +/ger; } sub uncomp62 { shift =~ s/./ (($& & "\3") | '@') x (ord($&) >> 2) /gesr; } # compress by gzip use IO::Compress::Gzip qw(gzip); sub compgzip { gzip \(shift) => \(my $output); $output; } use IO::Uncompress::Gunzip qw(gunzip); sub uncompgzip { gunzip \(shift) => \(my $output); $output; }

    -*- mode: compilation; default-directory: "d:/tmp/pm/" -*- Compilation started at Fri Sep 10 13:25:18 C:/Strawberry/perl/bin\perl.exe -w d:/tmp/pm/pack_63_chars.pl ------------------------------ Compression by gzip/gunzip length of data 19089 length of compressed data 1021 compressed to 5.3% MATCH ------------------------------ Compression by 2 bit code, 6 bit runlen +gth length of data 19089 length of compressed data 7714 compressed to 40.4% MATCH ------------------------------ Compression by 2 bits per letter length of data 19089 length of compressed data 4773 compressed to 25.0% MATCH ------------------------------ Compression by groups of 5,2,1 length of data 19089 length of compressed data 3819 compressed to 20.0% MATCH Compilation finished at Fri Sep 10 13:25:18

    Cheers Rolf
    (addicted to the Perl Programming Language :)
    Wikisyntax for the Monastery

    °) even after constantly reversing one part of the input I'm at 9% compression for factor 100 input.

      here is a random_data() which simulates the statistical properties of the 3 lines of data baxy77bax provided:

      5' EDIT on the script below, nothing important

      # by bliako on 10-09-2021 # for https://perlmonks.org/?node_id=11136582 # and specifically https://perlmonks.org/?node_id=11136632 use strict; use warnings; use Data::Dumper; # easily reproduce results or comment it out srand 1234; # how many lines of new data to produce my $lines_to_produce = 1000; # this is the actual data we want to discover # and simulate its statistical properties: my @actual_data = qw/ ABBCBCAAAAABBCBCACCCAAAAACAAAAABBBBBAAAAABBAAAAAAAABBCCCACCAABC BCCCBCAACAABBBCAAACCAAAAACAAAAABBBBBAAAAABBAAAAAAAABBCCCACCABBC ABCCBBBAAAABBABCACABCCCCCCAAAAABBCBBCCCCAAAAAAAAAAAAACCCACCACCC /; my $random_data = random_data(\@actual_data, $lines_to_produce); # and calculate and print statistics on the data just produced my ($xdist1, $xdist2) = CumProbDists($random_data); ## end sub random_data { my $actual_data = shift; my $lines = shift; # this calculates the mono-gram and di-gram # prob.dist of the actual data. # mono-gram P(A), di-gram P(A|B) my ($dist1, $dist2) = CumProbDists($actual_data); my $width = length $actual_data->[0]; my @results; while( $lines-- ){ my @line; my $letter = which_letter($dist1); push @line, $letter; for (2..$width){ $letter = which_letter($dist2->{$letter}); push @line, $letter; } push @results, join('', @line); } return \@results; } #### end main #### # given some data in the form of an arrayref of strings, it will # calculate cumulative probability distribution (1st frequency, then p +.d. # and then cumulative p.d.) sub CumProbDists { # make a copy because it destructs $data my $data = [ @{$_[0]} ]; # the results: my %dist1; # cpd for each letter A, B, C my %dist2; # cpd for each digram, e.g. A->A, A->B, C->A etc. for my $aline (@$data){ ################################### # I hope this somewhat obsene regex # does not violate any CoCs ################################### while( $aline =~ s/^(.)(.)/$2/g ){ $dist1{$1}++; $dist2{$1}->{$2}++; } } print "Frequencies:\n"; print Dumper(\%dist1); print Dumper(\%dist2); # convert to prob.dist. my $sum = 0; $sum += $_ for values %dist1; $_ /= $sum for values %dist1; for my $v1 (keys %dist1){ $sum = 0; $sum += $_ for values %{$dist2{$v1}}; $_ /= $sum for values %{$dist2{$v1}}; } print "Probability Distribution:\n"; print Dumper(\%dist1); print Dumper(\%dist2); # convert to cumulative prob.dist. $sum = 0; for (sort keys %dist1){ $dist1{$_} += $sum; $sum = $dist1{$_}; } for my $v1 (keys %dist1){ $sum = 0; for (sort keys %{$dist2{$v1}}){ $dist2{$v1}->{$_} += $sum; $sum = $dist2{$v1}->{$_}; } } print "Cumulative Probability distribution:\n"; print Dumper(\%dist1); print Dumper(\%dist2); return (\%dist1, \%dist2) } # given a cum-prob-dist (as a hashref where key is the letter to choos +e, # and value is the cum-prob-dist) # it will return a letter randomly but satisfying the distribution sub which_letter { my $dist = shift; my $rand = rand; for (sort keys %$dist){ if( $rand <= $dist->{$_} ){ return $_ } } }

      bw, bliako

        Well thanks, you are free to test it against tybalt's code :)

        Cheers Rolf
        (addicted to the Perl Programming Language :)
        Wikisyntax for the Monastery

      And now I faked totally random distribution of the alphabet, in 10000 lines like required in the OP.

      tybalt89's 5,2,1 wins with constant 20% at optimum, but zip is second best with spectacular 23-24%.

      That's the worst case for zip, if it can't find symmetric patterns or uneven distribution.

      But it's still compressing within close range to the optimum.

      use strict; # https://perlmonks.org/?node_id=11136582 use warnings; # originally https://perlmonks.org/?displ +aytype=displaycode;node_id=11136614;abspart=1;part=1 use feature 'say'; my $factor = 3333; my $data = <<END; ABBCBCAAAAABBCBCACCCAAAAACAAAAABBBBBAAAAABBAAAAAAAABBCCCACCAABC BCCCBCAACAABBBCAAACCAAAAACAAAAABBBBBAAAAABBAAAAAAAABBCCCACCABBC ABCCBBBAAAABBABCACABCCCCCCAAAAABBCBBCCCCAAAAAAAAAAAAACCCACCACCC END my @data = split/\n/, $data; # push @data, map{rotate($data[-$_])} reverse 1..3 for 1.. $factor; push @data, map{random_data()} reverse 1..3 for 1.. $factor; $data = join "", @data; # remove \n they can be re-inserted later for my $try ( [ 'gzip/gunzip', \&compgzip, \&uncompgzip ], [ '2 bit code, 6 bit runlength', \&comp62, \&uncomp62 ], [ '2 bits per letter', \&comp2bits, \&uncomp2bits ], [ 'groups of 5,2,1', \&comp5, \&uncomp5 ], ) { my ($method, $comp, $uncomp) = @$try; print "\n------------------------------ Compression by $method\n\n +"; print "length of data @{[ $data =~ tr/ABC// ]}\n"; my $compressed = $comp->($data); print "length of compressed data @{[ length $compressed ]}\n"; #use Data::Dump 'dd'; dd $compressed; # print unpack('H*', $compressed), "\n"; my $uncompressed = $uncomp->($compressed); printf "compressed to %.1f%%\n", 100 * length($compressed) / length $uncompressed; print $data eq $uncompressed ? "MATCH" : "************ no MATCH", +"\n"; } # ---- by lanx sub rotate { # fake data from original samp +le my $dbg =0; # say "--- O: ", my $orig =shift; # say my $rnd = int rand 63; my $head = reverse substr $orig,0,$rnd,""; # say "N: ",$orig."|".$head; my $new = $orig.$head; return $new; } sub random_data { #say my $new = join "", map{ chr int(rand 3) + ord "A" } 1..63; return $new; } # compress by groups of 5,2,1 to single letter sub comp5 { my @code = map glob('{A,B,C}' x $_), 5, 2, 1; my %code; @code{@code} = map chr, 1 .. @code; local $" = '|'; shift =~ s/(@code)/$code{$1}/gr } sub uncomp5 { my @code = map glob('{A,B,C}' x $_), 5, 2, 1; my %code; @code{map chr, 1 .. @code} = @code; join '', @code{split //, shift}; } # compress by lower two bits of letter sub comp2bits { my ($ans, $n) = ('', 0); vec($ans, $n++, 2) = 3 & ord $_ for split //, shift; $ans; } sub uncomp2bits { my $comp = shift; join '', map { ('', 'A', 'B', 'C')[ vec $comp, $_, 2] } 0 .. -1 + 4 * length $comp; } # compress by runlength or 6 bits length and 2 bits letter code sub comp62 { shift =~ s/([ABC])\1{0,62}/ chr( length($&) << 2 | ord($1) & 3) +/ger; } sub uncomp62 { shift =~ s/./ (($& & "\3") | '@') x (ord($&) >> 2) /gesr; } # compress by gzip use IO::Compress::Gzip qw(gzip); sub compgzip { gzip \(shift) => \(my $output); $output; } use IO::Uncompress::Gunzip qw(gunzip); sub uncompgzip { gunzip \(shift) => \(my $output); $output; }

      -*- mode: compilation; default-directory: "d:/tmp/pm/" -*- Compilation started at Fri Sep 10 15:00:47 C:/Strawberry/perl/bin\perl.exe -w d:/tmp/pm/pack_63_chars.pl ------------------------------ Compression by gzip/gunzip length of data 630126 length of compressed data 149769 compressed to 23.8% MATCH ------------------------------ Compression by 2 bit code, 6 bit runlen +gth length of data 630126 length of compressed data 420151 compressed to 66.7% MATCH ------------------------------ Compression by 2 bits per letter length of data 630126 length of compressed data 157532 compressed to 25.0% MATCH ------------------------------ Compression by groups of 5,2,1 length of data 630126 length of compressed data 126026 compressed to 20.0% MATCH Compilation finished at Fri Sep 10 15:00:58

      Cheers Rolf
      (addicted to the Perl Programming Language :)
      Wikisyntax for the Monastery

        And tybalt89 does the "I beat gzip" happy dance :)

        ( or sometimes special problem knowledge can beat a generalized algorithm. )