Re^4: a random_data() implementation

here is a random_data() which simulates the statistical properties of the 3 lines of data baxy77bax provided:

5' EDIT on the script below, nothing important

# by bliako on 10-09-2021
# for https://perlmonks.org/?node_id=11136582
# and specifically https://perlmonks.org/?node_id=11136632
use strict;
use warnings;

use Data::Dumper;

# easily reproduce results or comment it out
srand 1234;

# how many lines of new data to produce
my $lines_to_produce = 1000;

# this is the actual data we want to discover
# and simulate its statistical properties:
my @actual_data = qw/
    ABBCBCAAAAABBCBCACCCAAAAACAAAAABBBBBAAAAABBAAAAAAAABBCCCACCAABC
    BCCCBCAACAABBBCAAACCAAAAACAAAAABBBBBAAAAABBAAAAAAAABBCCCACCABBC
    ABCCBBBAAAABBABCACABCCCCCCAAAAABBCBBCCCCAAAAAAAAAAAAACCCACCACCC
/;

my $random_data = random_data(\@actual_data, $lines_to_produce);

# and calculate and print statistics on the data just produced
my ($xdist1, $xdist2) = CumProbDists($random_data);

## end

sub random_data {
    my $actual_data = shift;
        my $lines = shift;

    # this calculates the mono-gram and di-gram
    # prob.dist of the actual data.
    # mono-gram P(A), di-gram P(A|B)
    my ($dist1, $dist2) = CumProbDists($actual_data);

    my $width = length $actual_data->[0];
    my @results;
    while( $lines-- ){
        my @line;
        my $letter = which_letter($dist1);
        push @line, $letter;
        for (2..$width){
            $letter = which_letter($dist2->{$letter});
            push @line, $letter;
        }
        push @results, join('', @line);
    }
    return \@results;
}
#### end main ####

# given some data in the form of an arrayref of strings, it will
# calculate cumulative probability distribution (1st frequency, then p
+.d.
# and then cumulative p.d.)
sub CumProbDists {
    # make a copy because it destructs $data
    my $data = [ @{$_[0]} ];
    # the results:
    my %dist1; # cpd for each letter A, B, C
    my %dist2; # cpd for each digram, e.g. A->A, A->B, C->A etc.
    for my $aline (@$data){
        ###################################
        # I hope this somewhat obsene regex
        # does not violate any CoCs
        ###################################
            while( $aline =~ s/^(.)(.)/$2/g ){
                    $dist1{$1}++;
                    $dist2{$1}->{$2}++;
            }
    }
    print "Frequencies:\n";
    print Dumper(\%dist1);
    print Dumper(\%dist2);

    # convert to prob.dist.
    my $sum = 0;
    $sum += $_ for values %dist1;
    $_ /= $sum for values %dist1;
    for my $v1 (keys %dist1){
        $sum = 0;
        $sum += $_ for values %{$dist2{$v1}};
        $_ /= $sum for values %{$dist2{$v1}};
    }
    print "Probability Distribution:\n";
    print Dumper(\%dist1);
    print Dumper(\%dist2);

    # convert to cumulative prob.dist.
    $sum = 0;
    for (sort keys %dist1){
        $dist1{$_} += $sum;
        $sum = $dist1{$_};
    }
    for my $v1 (keys %dist1){
        $sum = 0;
        for (sort keys %{$dist2{$v1}}){
            $dist2{$v1}->{$_} += $sum;
            $sum = $dist2{$v1}->{$_};
        }
    }
    print "Cumulative Probability distribution:\n";
    print Dumper(\%dist1);
    print Dumper(\%dist2);

    return (\%dist1, \%dist2)
}
# given a cum-prob-dist (as a hashref where key is the letter to choos
+e,
# and value is the cum-prob-dist)
# it will return a letter randomly but satisfying the distribution
sub which_letter {
    my $dist = shift;
    my $rand = rand;
    for (sort keys %$dist){
        if( $rand <= $dist->{$_} ){ return $_ }
    }
}
[download]

bw, bliako

Comment on Re^4: a random_data() implementation Download Code

Replies are listed 'Best First'.
Re^5: a random_data() implementation by LanX (Saint) on Sep 10, 2021 at 15:41 UTC
Well thanks, you are free to test it against tybalt's code :) Cheers Rolf _{(addicted to the Perl Programming Language :) Wikisyntax for the Monastery}	[reply]
Re^6: a random_data() implementation by bliako (Abbot) on Sep 10, 2021 at 17:32 UTC
now I found some time. Here is the testing script, incorporating my `random_data()` into your original test script LanX posted and tybalt89 posted: Read more... (7 kB) Here are the statistical distribution of the 3-line data provided initially: Frequencies: $VAR1 = { 'B' => 44, 'A' => 93, 'C' => 49 }; $VAR1 = { 'C' => { 'C' => 25, 'B' => 5, 'A' => 19 }, 'A' => { 'C' => 11, 'A' => 66, 'B' => 16 }, 'B' => { 'C' => 16, 'A' => 6, 'B' => 22 } }; Probability Distribution: $VAR1 = { 'B' => '0.236559139784946', 'A' => '0.5', 'C' => '0.263440860215054' }; $VAR1 = { 'C' => { 'C' => '0.510204081632653', 'B' => '0.102040816326531', 'A' => '0.387755102040816' }, 'A' => { 'C' => '0.118279569892473', 'A' => '0.709677419354839', 'B' => '0.172043010752688' }, 'B' => { 'C' => '0.363636363636364', 'A' => '0.136363636363636', 'B' => '0.5' } }; Cumulative Probability distribution: $VAR1 = { 'B' => '0.736559139784946', 'A' => '0.5', 'C' => '1' }; $VAR1 = { 'C' => { 'C' => '1', 'B' => '0.489795918367347', 'A' => '0.387755102040816' }, 'A' => { 'C' => '1', 'A' => '0.709677419354839', 'B' => '0.881720430107527' }, 'B' => { 'C' => '1', 'A' => '0.136363636363636', 'B' => '0.636363636363636' } }; [download] And here are the compression comparisons: ------------------------------ Compression by gzip/gunzip length of data 210168 length of compressed data 45076 compressed to 21.4% MATCH ------------------------------ Compression by 2 bit code, 6 bit runlen +gth length of data 210168 length of compressed data 83690 compressed to 39.8% MATCH ------------------------------ Compression by 2 bits per letter length of data 210168 length of compressed data 52542 compressed to 25.0% MATCH ------------------------------ Compression by groups of 5,2,1 length of data 210168 length of compressed data 42035 compressed to 20.0% MATCH [download] bw, bliako	[reply] [d/l] [select]
Re^7: a random_data() implementation by LanX (Saint) on Sep 10, 2021 at 19:23 UTC
did you also reproduce the chunks with same character? or at least di- and trigrams? zip does run-length-encoding. Cheers Rolf _{(addicted to the Perl Programming Language :) Wikisyntax for the Monastery}	[reply]
Re^8: a random_data() implementation by bliako (Abbot) on Sep 10, 2021 at 20:33 UTC
Re^9: a random_data() implementation by LanX (Saint) on Sep 11, 2021 at 14:31 UTC
Re^9: a random_data() implementation by LanX (Saint) on Sep 10, 2021 at 20:51 UTC