in reply to n-dimensional statistical analysis of DNA sequences (or text, or ...)

bliako, what an interesting subject matter! As of yet, I'm unable to follow you. My first problem is that I get nothing from the wget command:

$ wget ftp://ftp.ncbi.nih.gov/genomes/Homo_sapiens/CHR_20/hs_ref_GRCh3 +8.p12_chr20.fa.gz --2019-01-23 07:54:51-- ftp://ftp.ncbi.nih.gov/genomes/Homo_sapiens/C +HR_20/hs_ref_GRCh38.p12_chr20.fa.gz => ‘hs_ref_GRCh38.p12_chr20.fa.gz’ Resolving ftp.ncbi.nih.gov (ftp.ncbi.nih.gov)... 130.14.250.12, 2607:f +220:41e:250::12 Connecting to ftp.ncbi.nih.gov (ftp.ncbi.nih.gov)|130.14.250.12|:21... + failed: Connection timed out. Connecting to ftp.ncbi.nih.gov (ftp.ncbi.nih.gov)|2607:f220:41e:250::1 +2|:21... failed: Network is unreachable. $

I took some care that newlines, spaces and plusses were removed from the foldover effect, but that is error-prone. I do see 'nih' in the url and suspect that it might be closed because of Orange Ulbricht.(?)

I can't call the package properly. I usually don't have packages with scope, so I don't know what to do with this. I tried to imitate what cpan does for module and then use as a lib. Perl isn't impressed:

$ mkdir Markov $ cd Markov/ $ touch Ndimensional.pm $ gedit Ndimensional.pm & [1] 20524 $ pwd /home/bob/2.scripts/pages/7.cw/template_stuff/crosswords/eugene/Markov [1]+ Done gedit Ndimensional.pm ^^^^ pasted in source $ cd .. $ ls 1.bliako.pl bliako1.pm Markov $ ./1.bliako.pl Can't locate Markov/Ndimensional.pm in @INC (you may need to install t +he Markov::Ndimensional module) (@INC contains: Markov /home/bob/perl +5/lib/perl5/x86_64-linux-gnu-thread-multi /home/bob/perl5/lib/perl5 / +etc/perl /usr/local/lib/x86_64-linux-gnu/perl/5.26.1 /usr/local/share +/perl/5.26.1 /usr/lib/x86_64-linux-gnu/perl5/5.26 /usr/share/perl5 /u +sr/lib/x86_64-linux-gnu/perl/5.26 /usr/share/perl/5.26 /usr/local/lib +/site_perl /usr/lib/x86_64-linux-gnu/perl-base) at ./1.bliako.pl line + 13. BEGIN failed--compilation aborted at ./1.bliako.pl line 13. $

, where caller looks like this:

$ cat 1.bliako.pl #!/usr/bin/env perl # FILE: analyse_DNA_sequence.pl # by bliako use strict; use warnings; use Getopt::Long; use Data::Dump qw/dump/; use lib 'Markov'; use Markov::Ndimensional;

My question then is whether the download is there for anyone else and where to put Ndimensional.pm and how to call it correctly.

I think it's a way in which perl truly is relevant.

Replies are listed 'Best First'.
Re^2: n-dimensional statistical analysis of DNA sequences (or text, or ...)
by pryrt (Abbot) on Jan 23, 2019 at 17:55 UTC
    use lib 'Markov'; use Markov::Ndimensional;

    ... is looking for Markov/Markov/Ndimensional.pm. Try:

    use lib '.'; use Markov::Ndimensional;
      $ ./1.bliako.pl ./1.bliako.pl : ngram-length must be a positive integer. $ cat 1.bliako.pl #!/usr/bin/env perl # FILE: analyse_DNA_sequence.pl # by bliako use strict; use warnings; use Getopt::Long; use Data::Dump qw/dump/; use lib '.'; use Markov::Ndimensional;

      Great, I think that's gonna work with some proper input. Does this look right between p tags:

      wget ftp://ftp.ncbi.nih.gov/genomes/Homo_sapiens/CHR_20/hs_ref_GRCh38.p12_chr20.fa.gz

        The link is right as you just posted it

        When you download the sequence file you need to uncompress it: gunzip hs_ref_GRCh38.p12_chr20.fa.gz

        Then, the command in the brief usage section of my original post analyse_DNA_sequence.pl --input-fasta hs_ref_GRCh38.p12_chr20.fa --ngram-length 3 --output-state hs_ref_GRCh38.p12_chr20.fa.3.state --output-stats stats.txt means the following:

        • You are studying triplets of bases, e.g. ATC. This is the ngram-length = 3
        • Once finished, the object holding the stats will be serialised to disk as hs_ref_GRCh38.p12_chr20.fa.3.state
        • Brief info will be dumped to stats.txt

        If you want to remember your findings from the first sequence file when you process a second and append new data to it, then you use the previously output ".state" file as input state (--input-state) for your new run. This is to allow processing gigabytes of sequences incrementaly in small batches.

        bliako

Re^2: n-dimensional statistical analysis of DNA sequences (or text, or ...)
by bliako (Abbot) on Jan 23, 2019 at 19:28 UTC

    Regarding downloading the DNA sequence file try pasting the url on your browser or give curl a try. For the record, wget works in my system, maybe because of some wgetrc setting like passive ftp ON or user-agent string.

    pryrt answered your 2nd query! Basically set your lib (by use of use lib) wherever the module file is placed. If you want to replicate my system: mkdir Markov; mv Ndimensional.pm Markov mkdir -p lib/Markov; mv Ndimensional.pm lib/Markov should do it.

    Btw, this is where DNA sequences from a lot of species including homosapiens (HS) are located: ftp.ncbi.nih.gov/genome. If you can spare the CPU you can do statistical analyses not only between different chromosomes of HS but also between HS and other species. What's needed now is to take that frequency counts and compare them or just plot.

    Btw2, in addition to ATCG bases (canonical) one gets a few more, notably "N". Check Re^10: Reduce RAM required for more. I don't know more on these.

    Hope you get it running soon

      Hope you get it running soon.

      I've been running this software to try to understand and must remark at how impressed I am, now with results to show. I'll try to show the output without breaking the reader's scroll finger.

      The first program analyzes the dna at that zipped download. Then you gotta unzip it:

       gunzip -k hs_ref_GRCh38.p12_chr20.fa.gz

      Even between readmore tags, I'll edit it to highlight points.

      What gives with the clumpiness of the N's? They look like the duct tape that holds everything else together.

      It was also helpful for me to run it again and compare output:

      $ diff 1.bliako.txt 2.bliako.txt >3.bliako.txt

      Can you explain why these might be different one run to the next:

      285c285 < NT => ["A", 0.15, "C", 0.5, "G", 0.8, "T", 1], --- > NT => ["C", 0.35, "G", 0.65, "T", 0.85, "A", 1],

      Downloading shelley was easy.

      $ ./1.predict.pl --input-state 84.state ./1.predict.pl : read state from '84.state', ngram-length is 2. ./1.predict.pl : starting with seed 'futile' ... futile were placed at once turned with anxious suspense I threw me whe +n she might be guilty are soul have wandered with Project Gutenberg t +m works based ./1.predict.pl : done. $
      ./1.predict.pl : starting with seed 'reasoning' ... reasoning I paid a drunken ./1.predict.pl : done. $

      The .state file was truly amazing:

      friendship => ["and", 0.5, "You", 1], frightened => ["as", 0.5, "me", 1], frightful => [ "darkness", 0.111111111111111, "that", 0.222222222222222, "catalogue", 0.333333333333333, "selfishness", 0.444444444444444, "an", 0.555555555555556, "I", 0.666666666666667, "fiend", 0.777777777777778, "the", 0.888888888888889, "dreams", 1, ],

      But wait, is this software that makes a frankensentence? (I heard that it is one of the gifts that the magi brought to the child in nativity story.)

      This is incredible:

      $ ./1.predict.pl --input-state 84.state ./1.predict.pl : read state from '84.state', ngram-length is 2. ./1.predict.pl : starting with seed 'futile' ... futile were placed at once turned with anxious suspense I threw me whe +n she might be guilty are soul have wandered with Project Gutenberg t +m works based ./1.predict.pl : done. $ ./1.predict.pl --input-state 84.state ./1.predict.pl : read state from '84.state', ngram-length is 2. ./1.predict.pl : starting with seed 'reasoning' ... reasoning I paid a drunken ./1.predict.pl : done.

      I hope I didn't beat it up too much trying to replicate it....

        Can you explain why these might be different one run to the next: 285c285 < NT => ["A", 0.15, "C", 0.5, "G", 0.8, "T", 1], --- > NT => ["C", 0.35, "G", 0.65, "T", 0.85, "A", 1],

        Yes, this is cumulative probability. In order to get the actual probability subtract the one you are interested from its previous if it has any. For example, P(A|NT) = 0.15 and also P(A|NT) = 1 - 0.85=0.15 . So, P(A|NT)=0.15, P(C|NT)=0.35, P(G|NT)=0.3, P(T|NT)=0.2

        Because of the keys %hash always returning a random order of the keys, I could not avoid getting different representation of the same output without using an expensive sort. Likewise for not wasting memory by also calculating a "proper" probabilities in the way I have just shown you.

        Btw, that last  "N" => 3 is just your ngram-length.

        Well, it seems you have created a frankestein. it will end in tears ... brrrrrrrr.