in reply to Re^2: n-dimensional statistical analysis of DNA sequences (or text, or ...)
in thread n-dimensional statistical analysis of DNA sequences (or text, or ...)
I've been running this software to try to understand and must remark at how impressed I am, now with results to show. I'll try to show the output without breaking the reader's scroll finger.
The first program analyzes the dna at that zipped download. Then you gotta unzip it:
gunzip -k hs_ref_GRCh38.p12_chr20.fa.gzEven between readmore tags, I'll edit it to highlight points.
{ "counts" => { AAA => 2104860, AAC => 887983, AAG => 1197774, AAN => 4, AAT => 1303394, ACA => 1224148, ACC => 766646, ACG => 188734, ACN => 3, ACT => 979085, AGA => 1372793, AGC => 973647, AGG => 1231994, AGN => 6, AGT => 974059, ANA => 1, ANN => 21, ATA => 1032519, ATC => 796743, ATG => 1102255, ATN => 6, ... "cum-twisted-dist" => { AA => [ "A", 0.383118721008224, "T", 0.620357607323606, "N", 0.6203583353886, "C", 0.781985669860749, "G", 1, ], AC => [ "A", 0.387558348339906, "C", 0.630274145385194, "G", 0.690026264667816, "T", 0.99999905021693, "N", 1, ], AG => [ "A", 0.301547128291516, "C", 0.515418015467988, "T", 0.729379402389764, "N", 0.72938072034722, "G", 1, ], AN => ["N", 0.954545454545455, "A", 1], ... NN => [ "A", 0.000263451851539613, "T", 0.000442599110586549, "C", 0.000547979851202394, "N", 0.999915695407507, "G", 1, ], ... "dist" => { AAA => 0.0335770294488858, AAC => 0.0141652325290566, AAG => 0.0191070631163639, AAN => 6.3808575295052e-08, AAT => 0.0207919285470298, ... NAA => 6.3808575295052e-08, NAC => 3.1904287647526e-08, NAG => 2.71186445003971e-07, NAT => 6.3808575295052e-08, NCA => 3.1904287647526e-08, NCC => 3.1904287647526e-08, NCT => 7.97607191188151e-08, NGA => 6.3808575295052e-08, NGC => 1.5952143823763e-08, NGG => 3.1904287647526e-08, NGT => 7.97607191188151e-08, NNA => 3.98803595594075e-07, NNC => 1.5952143823763e-07, NNG => 1.27617150590104e-07, NNN => 0.00151280560738274, NNT => 2.71186445003971e-07, NTA => 4.7856431471289e-08, NTC => 1.11665006766341e-07, NTG => 9.57128629425781e-08, NTT => 6.3808575295052e-08, ... TTN => 9.57128629425781e-08, TTT => 0.035068794178565, }, "N" => 3, } ./1.bliako.pl : done.
What gives with the clumpiness of the N's? They look like the duct tape that holds everything else together.
It was also helpful for me to run it again and compare output:
$ diff 1.bliako.txt 2.bliako.txt >3.bliako.txtCan you explain why these might be different one run to the next:
285c285 < NT => ["A", 0.15, "C", 0.5, "G", 0.8, "T", 1], --- > NT => ["C", 0.35, "G", 0.65, "T", 0.85, "A", 1],
Downloading shelley was easy.
$ ./1.predict.pl --input-state 84.state ./1.predict.pl : read state from '84.state', ngram-length is 2. ./1.predict.pl : starting with seed 'futile' ... futile were placed at once turned with anxious suspense I threw me whe +n she might be guilty are soul have wandered with Project Gutenberg t +m works based ./1.predict.pl : done. $
./1.predict.pl : starting with seed 'reasoning' ... reasoning I paid a drunken ./1.predict.pl : done. $
The .state file was truly amazing:
friendship => ["and", 0.5, "You", 1], frightened => ["as", 0.5, "me", 1], frightful => [ "darkness", 0.111111111111111, "that", 0.222222222222222, "catalogue", 0.333333333333333, "selfishness", 0.444444444444444, "an", 0.555555555555556, "I", 0.666666666666667, "fiend", 0.777777777777778, "the", 0.888888888888889, "dreams", 1, ],
But wait, is this software that makes a frankensentence? (I heard that it is one of the gifts that the magi brought to the child in nativity story.)
This is incredible:
$ ./1.predict.pl --input-state 84.state ./1.predict.pl : read state from '84.state', ngram-length is 2. ./1.predict.pl : starting with seed 'futile' ... futile were placed at once turned with anxious suspense I threw me whe +n she might be guilty are soul have wandered with Project Gutenberg t +m works based ./1.predict.pl : done. $ ./1.predict.pl --input-state 84.state ./1.predict.pl : read state from '84.state', ngram-length is 2. ./1.predict.pl : starting with seed 'reasoning' ... reasoning I paid a drunken ./1.predict.pl : done.
I hope I didn't beat it up too much trying to replicate it....
|
|---|
| Replies are listed 'Best First'. | |
|---|---|
|
Re^4: n-dimensional statistical analysis of DNA sequences (or text, or ...)
by bliako (Abbot) on Jan 23, 2019 at 23:32 UTC |