in reply to Re^2: n-dimensional statistical analysis of DNA sequences (or text, or ...)
in thread n-dimensional statistical analysis of DNA sequences (or text, or ...)

Hope you get it running soon.

I've been running this software to try to understand and must remark at how impressed I am, now with results to show. I'll try to show the output without breaking the reader's scroll finger.

The first program analyzes the dna at that zipped download. Then you gotta unzip it:

 gunzip -k hs_ref_GRCh38.p12_chr20.fa.gz

Even between readmore tags, I'll edit it to highlight points.

{ "counts" => { AAA => 2104860, AAC => 887983, AAG => 1197774, AAN => 4, AAT => 1303394, ACA => 1224148, ACC => 766646, ACG => 188734, ACN => 3, ACT => 979085, AGA => 1372793, AGC => 973647, AGG => 1231994, AGN => 6, AGT => 974059, ANA => 1, ANN => 21, ATA => 1032519, ATC => 796743, ATG => 1102255, ATN => 6, ... "cum-twisted-dist" => { AA => [ "A", 0.383118721008224, "T", 0.620357607323606, "N", 0.6203583353886, "C", 0.781985669860749, "G", 1, ], AC => [ "A", 0.387558348339906, "C", 0.630274145385194, "G", 0.690026264667816, "T", 0.99999905021693, "N", 1, ], AG => [ "A", 0.301547128291516, "C", 0.515418015467988, "T", 0.729379402389764, "N", 0.72938072034722, "G", 1, ], AN => ["N", 0.954545454545455, "A", 1], ... NN => [ "A", 0.000263451851539613, "T", 0.000442599110586549, "C", 0.000547979851202394, "N", 0.999915695407507, "G", 1, ], ... "dist" => { AAA => 0.0335770294488858, AAC => 0.0141652325290566, AAG => 0.0191070631163639, AAN => 6.3808575295052e-08, AAT => 0.0207919285470298, ... NAA => 6.3808575295052e-08, NAC => 3.1904287647526e-08, NAG => 2.71186445003971e-07, NAT => 6.3808575295052e-08, NCA => 3.1904287647526e-08, NCC => 3.1904287647526e-08, NCT => 7.97607191188151e-08, NGA => 6.3808575295052e-08, NGC => 1.5952143823763e-08, NGG => 3.1904287647526e-08, NGT => 7.97607191188151e-08, NNA => 3.98803595594075e-07, NNC => 1.5952143823763e-07, NNG => 1.27617150590104e-07, NNN => 0.00151280560738274, NNT => 2.71186445003971e-07, NTA => 4.7856431471289e-08, NTC => 1.11665006766341e-07, NTG => 9.57128629425781e-08, NTT => 6.3808575295052e-08, ... TTN => 9.57128629425781e-08, TTT => 0.035068794178565, }, "N" => 3, } ./1.bliako.pl : done.

What gives with the clumpiness of the N's? They look like the duct tape that holds everything else together.

It was also helpful for me to run it again and compare output:

$ diff 1.bliako.txt 2.bliako.txt >3.bliako.txt

Can you explain why these might be different one run to the next:

285c285 < NT => ["A", 0.15, "C", 0.5, "G", 0.8, "T", 1], --- > NT => ["C", 0.35, "G", 0.65, "T", 0.85, "A", 1],

Downloading shelley was easy.

$ ./1.predict.pl --input-state 84.state ./1.predict.pl : read state from '84.state', ngram-length is 2. ./1.predict.pl : starting with seed 'futile' ... futile were placed at once turned with anxious suspense I threw me whe +n she might be guilty are soul have wandered with Project Gutenberg t +m works based ./1.predict.pl : done. $
./1.predict.pl : starting with seed 'reasoning' ... reasoning I paid a drunken ./1.predict.pl : done. $

The .state file was truly amazing:

friendship => ["and", 0.5, "You", 1], frightened => ["as", 0.5, "me", 1], frightful => [ "darkness", 0.111111111111111, "that", 0.222222222222222, "catalogue", 0.333333333333333, "selfishness", 0.444444444444444, "an", 0.555555555555556, "I", 0.666666666666667, "fiend", 0.777777777777778, "the", 0.888888888888889, "dreams", 1, ],

But wait, is this software that makes a frankensentence? (I heard that it is one of the gifts that the magi brought to the child in nativity story.)

This is incredible:

$ ./1.predict.pl --input-state 84.state ./1.predict.pl : read state from '84.state', ngram-length is 2. ./1.predict.pl : starting with seed 'futile' ... futile were placed at once turned with anxious suspense I threw me whe +n she might be guilty are soul have wandered with Project Gutenberg t +m works based ./1.predict.pl : done. $ ./1.predict.pl --input-state 84.state ./1.predict.pl : read state from '84.state', ngram-length is 2. ./1.predict.pl : starting with seed 'reasoning' ... reasoning I paid a drunken ./1.predict.pl : done.

I hope I didn't beat it up too much trying to replicate it....

Replies are listed 'Best First'.
Re^4: n-dimensional statistical analysis of DNA sequences (or text, or ...)
by bliako (Abbot) on Jan 23, 2019 at 23:32 UTC
    Can you explain why these might be different one run to the next: 285c285 < NT => ["A", 0.15, "C", 0.5, "G", 0.8, "T", 1], --- > NT => ["C", 0.35, "G", 0.65, "T", 0.85, "A", 1],

    Yes, this is cumulative probability. In order to get the actual probability subtract the one you are interested from its previous if it has any. For example, P(A|NT) = 0.15 and also P(A|NT) = 1 - 0.85=0.15 . So, P(A|NT)=0.15, P(C|NT)=0.35, P(G|NT)=0.3, P(T|NT)=0.2

    Because of the keys %hash always returning a random order of the keys, I could not avoid getting different representation of the same output without using an expensive sort. Likewise for not wasting memory by also calculating a "proper" probabilities in the way I have just shown you.

    Btw, that last  "N" => 3 is just your ngram-length.

    Well, it seems you have created a frankestein. it will end in tears ... brrrrrrrr.