in reply to Re: n-dimensional statistical analysis of DNA sequences (or text, or ...)
in thread n-dimensional statistical analysis of DNA sequences (or text, or ...)

Regarding downloading the DNA sequence file try pasting the url on your browser or give curl a try. For the record, wget works in my system, maybe because of some wgetrc setting like passive ftp ON or user-agent string.

pryrt answered your 2nd query! Basically set your lib (by use of use lib) wherever the module file is placed. If you want to replicate my system: mkdir Markov; mv Ndimensional.pm Markov mkdir -p lib/Markov; mv Ndimensional.pm lib/Markov should do it.

Btw, this is where DNA sequences from a lot of species including homosapiens (HS) are located: ftp.ncbi.nih.gov/genome. If you can spare the CPU you can do statistical analyses not only between different chromosomes of HS but also between HS and other species. What's needed now is to take that frequency counts and compare them or just plot.

Btw2, in addition to ATCG bases (canonical) one gets a few more, notably "N". Check Re^10: Reduce RAM required for more. I don't know more on these.

Hope you get it running soon

Replies are listed 'Best First'.
Re^3: n-dimensional statistical analysis of DNA sequences (or text, or ...)
by Aldebaran (Curate) on Jan 23, 2019 at 23:09 UTC
    Hope you get it running soon.

    I've been running this software to try to understand and must remark at how impressed I am, now with results to show. I'll try to show the output without breaking the reader's scroll finger.

    The first program analyzes the dna at that zipped download. Then you gotta unzip it:

     gunzip -k hs_ref_GRCh38.p12_chr20.fa.gz

    Even between readmore tags, I'll edit it to highlight points.

    What gives with the clumpiness of the N's? They look like the duct tape that holds everything else together.

    It was also helpful for me to run it again and compare output:

    $ diff 1.bliako.txt 2.bliako.txt >3.bliako.txt

    Can you explain why these might be different one run to the next:

    285c285 < NT => ["A", 0.15, "C", 0.5, "G", 0.8, "T", 1], --- > NT => ["C", 0.35, "G", 0.65, "T", 0.85, "A", 1],

    Downloading shelley was easy.

    $ ./1.predict.pl --input-state 84.state ./1.predict.pl : read state from '84.state', ngram-length is 2. ./1.predict.pl : starting with seed 'futile' ... futile were placed at once turned with anxious suspense I threw me whe +n she might be guilty are soul have wandered with Project Gutenberg t +m works based ./1.predict.pl : done. $
    ./1.predict.pl : starting with seed 'reasoning' ... reasoning I paid a drunken ./1.predict.pl : done. $

    The .state file was truly amazing:

    friendship => ["and", 0.5, "You", 1], frightened => ["as", 0.5, "me", 1], frightful => [ "darkness", 0.111111111111111, "that", 0.222222222222222, "catalogue", 0.333333333333333, "selfishness", 0.444444444444444, "an", 0.555555555555556, "I", 0.666666666666667, "fiend", 0.777777777777778, "the", 0.888888888888889, "dreams", 1, ],

    But wait, is this software that makes a frankensentence? (I heard that it is one of the gifts that the magi brought to the child in nativity story.)

    This is incredible:

    $ ./1.predict.pl --input-state 84.state ./1.predict.pl : read state from '84.state', ngram-length is 2. ./1.predict.pl : starting with seed 'futile' ... futile were placed at once turned with anxious suspense I threw me whe +n she might be guilty are soul have wandered with Project Gutenberg t +m works based ./1.predict.pl : done. $ ./1.predict.pl --input-state 84.state ./1.predict.pl : read state from '84.state', ngram-length is 2. ./1.predict.pl : starting with seed 'reasoning' ... reasoning I paid a drunken ./1.predict.pl : done.

    I hope I didn't beat it up too much trying to replicate it....

      Can you explain why these might be different one run to the next: 285c285 < NT => ["A", 0.15, "C", 0.5, "G", 0.8, "T", 1], --- > NT => ["C", 0.35, "G", 0.65, "T", 0.85, "A", 1],

      Yes, this is cumulative probability. In order to get the actual probability subtract the one you are interested from its previous if it has any. For example, P(A|NT) = 0.15 and also P(A|NT) = 1 - 0.85=0.15 . So, P(A|NT)=0.15, P(C|NT)=0.35, P(G|NT)=0.3, P(T|NT)=0.2

      Because of the keys %hash always returning a random order of the keys, I could not avoid getting different representation of the same output without using an expensive sort. Likewise for not wasting memory by also calculating a "proper" probabilities in the way I have just shown you.

      Btw, that last  "N" => 3 is just your ngram-length.

      Well, it seems you have created a frankestein. it will end in tears ... brrrrrrrr.