in reply to Re^3: How to efficently pack a string of 63 characters (longer input)
in thread How to efficently pack a string of 63 characters

here is a random_data() which simulates the statistical properties of the 3 lines of data baxy77bax provided:

5' EDIT on the script below, nothing important

# by bliako on 10-09-2021 # for https://perlmonks.org/?node_id=11136582 # and specifically https://perlmonks.org/?node_id=11136632 use strict; use warnings; use Data::Dumper; # easily reproduce results or comment it out srand 1234; # how many lines of new data to produce my $lines_to_produce = 1000; # this is the actual data we want to discover # and simulate its statistical properties: my @actual_data = qw/ ABBCBCAAAAABBCBCACCCAAAAACAAAAABBBBBAAAAABBAAAAAAAABBCCCACCAABC BCCCBCAACAABBBCAAACCAAAAACAAAAABBBBBAAAAABBAAAAAAAABBCCCACCABBC ABCCBBBAAAABBABCACABCCCCCCAAAAABBCBBCCCCAAAAAAAAAAAAACCCACCACCC /; my $random_data = random_data(\@actual_data, $lines_to_produce); # and calculate and print statistics on the data just produced my ($xdist1, $xdist2) = CumProbDists($random_data); ## end sub random_data { my $actual_data = shift; my $lines = shift; # this calculates the mono-gram and di-gram # prob.dist of the actual data. # mono-gram P(A), di-gram P(A|B) my ($dist1, $dist2) = CumProbDists($actual_data); my $width = length $actual_data->[0]; my @results; while( $lines-- ){ my @line; my $letter = which_letter($dist1); push @line, $letter; for (2..$width){ $letter = which_letter($dist2->{$letter}); push @line, $letter; } push @results, join('', @line); } return \@results; } #### end main #### # given some data in the form of an arrayref of strings, it will # calculate cumulative probability distribution (1st frequency, then p +.d. # and then cumulative p.d.) sub CumProbDists { # make a copy because it destructs $data my $data = [ @{$_[0]} ]; # the results: my %dist1; # cpd for each letter A, B, C my %dist2; # cpd for each digram, e.g. A->A, A->B, C->A etc. for my $aline (@$data){ ################################### # I hope this somewhat obsene regex # does not violate any CoCs ################################### while( $aline =~ s/^(.)(.)/$2/g ){ $dist1{$1}++; $dist2{$1}->{$2}++; } } print "Frequencies:\n"; print Dumper(\%dist1); print Dumper(\%dist2); # convert to prob.dist. my $sum = 0; $sum += $_ for values %dist1; $_ /= $sum for values %dist1; for my $v1 (keys %dist1){ $sum = 0; $sum += $_ for values %{$dist2{$v1}}; $_ /= $sum for values %{$dist2{$v1}}; } print "Probability Distribution:\n"; print Dumper(\%dist1); print Dumper(\%dist2); # convert to cumulative prob.dist. $sum = 0; for (sort keys %dist1){ $dist1{$_} += $sum; $sum = $dist1{$_}; } for my $v1 (keys %dist1){ $sum = 0; for (sort keys %{$dist2{$v1}}){ $dist2{$v1}->{$_} += $sum; $sum = $dist2{$v1}->{$_}; } } print "Cumulative Probability distribution:\n"; print Dumper(\%dist1); print Dumper(\%dist2); return (\%dist1, \%dist2) } # given a cum-prob-dist (as a hashref where key is the letter to choos +e, # and value is the cum-prob-dist) # it will return a letter randomly but satisfying the distribution sub which_letter { my $dist = shift; my $rand = rand; for (sort keys %$dist){ if( $rand <= $dist->{$_} ){ return $_ } } }

bw, bliako

Replies are listed 'Best First'.
Re^5: a random_data() implementation
by LanX (Saint) on Sep 10, 2021 at 15:41 UTC
    Well thanks, you are free to test it against tybalt's code :)

    Cheers Rolf
    (addicted to the Perl Programming Language :)
    Wikisyntax for the Monastery

      now I found some time.

      Here is the testing script, incorporating my random_data() into your original test script LanX posted and tybalt89 posted:

      Here are the statistical distribution of the 3-line data provided initially:

      Frequencies: $VAR1 = { 'B' => 44, 'A' => 93, 'C' => 49 }; $VAR1 = { 'C' => { 'C' => 25, 'B' => 5, 'A' => 19 }, 'A' => { 'C' => 11, 'A' => 66, 'B' => 16 }, 'B' => { 'C' => 16, 'A' => 6, 'B' => 22 } }; Probability Distribution: $VAR1 = { 'B' => '0.236559139784946', 'A' => '0.5', 'C' => '0.263440860215054' }; $VAR1 = { 'C' => { 'C' => '0.510204081632653', 'B' => '0.102040816326531', 'A' => '0.387755102040816' }, 'A' => { 'C' => '0.118279569892473', 'A' => '0.709677419354839', 'B' => '0.172043010752688' }, 'B' => { 'C' => '0.363636363636364', 'A' => '0.136363636363636', 'B' => '0.5' } }; Cumulative Probability distribution: $VAR1 = { 'B' => '0.736559139784946', 'A' => '0.5', 'C' => '1' }; $VAR1 = { 'C' => { 'C' => '1', 'B' => '0.489795918367347', 'A' => '0.387755102040816' }, 'A' => { 'C' => '1', 'A' => '0.709677419354839', 'B' => '0.881720430107527' }, 'B' => { 'C' => '1', 'A' => '0.136363636363636', 'B' => '0.636363636363636' } };

      And here are the compression comparisons:

      ------------------------------ Compression by gzip/gunzip length of data 210168 length of compressed data 45076 compressed to 21.4% MATCH ------------------------------ Compression by 2 bit code, 6 bit runlen +gth length of data 210168 length of compressed data 83690 compressed to 39.8% MATCH ------------------------------ Compression by 2 bits per letter length of data 210168 length of compressed data 52542 compressed to 25.0% MATCH ------------------------------ Compression by groups of 5,2,1 length of data 210168 length of compressed data 42035 compressed to 20.0% MATCH

      bw, bliako

        did you also reproduce the chunks with same character?

        or at least di- and trigrams?

        zip does run-length-encoding.

        Cheers Rolf
        (addicted to the Perl Programming Language :)
        Wikisyntax for the Monastery