Instead of calculating I wrote a little script counting the probabilities of group combinations.

Roboticus was right, you need about 14,05 characters per line plus "\n"

Only 1 + 8 + 21 + 20 = 50 combinations are possible per group, resulting in a conservative compression of 50.79 bits = 6.34 bytes per line, which already means a compression to 42% = 58% win.

But if you look into the likelihood of those combinations you clearly see that a Huffman encoding would result in an even better ratio.

I think that's then near the theoretical optimum. (further reading Huffman_coding_with_unequal_letter_costs )

All this supposing your input was real... ;-)

use strict; use warnings; use Data::Dump qw/pp/; my %count; for my $c0 (0..9){ for my $c1 (0..9){ for my $c2 (0..9){ for my $c3 (0..9) { my @c = sort {$a <=>$b} ($c0,$c1,$c2,$c3); #print "@c\t:\t"; my @allowed; for my $i (1..3) { if ( $c[$i] != $c[$i-1] && $c[$i] != $c[$i-1]+1 ) +{ push @allowed, $c[$i] } } #print "@allowed\n"; $count{join "",@allowed}++ } } } } my @length; my $average; for my $k (keys %count){ my $len = length $k; $length[$len]++; $average+= $len* $count{$k}/10000; } warn '@length: ', pp \@length; warn 'average #characters: group/line',pp [$average,$average*9]; my $combies =0; $combies+= $length[$_] for 0..3; #$combies=93; warn "# possible combinations: ", $combies; my $upper_bound= log($combies)/log(2)*9; warn 'Upper bound bits, bytes', pp [$upper_bound, $upper_bound/8]; #warn "ranking", pp [ sort {$b <=>$a} values %count ]; warn 'probabilities: ',pp \%count;

@length: [1, 8, 21, 20] at /tmp/compress.pl line 36. average #characters: group/line[1.5624, 14.0616] at /tmp/compress.pl l +ine 37. # possible combinations: 50 at /tmp/compress.pl line 44. Upper bound bits, bytes[50.7947057079725, 6.34933821349656] at /tmp/co +mpress.pl line 48. probabilities: { "" => 592, "2" => 74, "24" => 60, "246" => 24, "247" => 24, "248" => 24, "249" => 24, "25" => 84, "257" => 24, "258" => 24, "259" => 24, "26" => 84, "268" => 24, "269" => 24, "27" => 84, "279" => 24, "28" => 84, "29" => 60, "3" => 208, "35" => 144, "357" => 48, "358" => 48, "359" => 48, "36" => 192, "368" => 48, "369" => 48, "37" => 192, "379" => 48, "38" => 192, "39" => 144, "4" => 366, "46" => 228, "468" => 72, "469" => 72, "47" => 300, "479" => 72, "48" => 300, "49" => 228, "5" => 524, "57" => 312, "579" => 96, "58" => 408, "59" => 312, "6" => 682, "68" => 396, "69" => 396, "7" => 840, "79" => 336, "8" => 830, "9" => 508, } at /tmp/compress.pl line 52.

Cheers Rolf
(addicted to the Perl Programming Language :)
Wikisyntax for the Monastery FootballPerl is like chess, only without the dice

update

I just realized that Roboticus already had the same basic ideas here: Re: Data compression by 50% + : is it possible?


In reply to Re^3: Data compression by 50% + : is it possible? by LanX
in thread Data compression by 50% + : is it possible? by baxy77bax

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.