in reply to Re^7: n-dimensional statistical analysis of DNA sequences (or text, or ...)
in thread n-dimensional statistical analysis of DNA sequences (or text, or ...)

Aldebaran thank you for testing this module so thoroughly and persistently.

I have made some modifications so that it uses johngg's method of getting a sliding window of $ngram_length words at a time from Re: improving speed in ngrams algorithm. My original code was not sliding at all! Thanks for pointing it out.

I have also added another feature --process-paragraphs which will first convert the text to paragraphs using this algorithm: replace all single newlines with a space so that repeated newlines will be the boundary separating paragraphs. And apply the sliding ngram window on paragraphs rather than lines which is the default. When in "lines" mode, the sliding window does not take into account previous, next lines. But it will stop at the end of the line. Then start a new sliding window at next line. I have noticed that your test input (shelley) has newlines separating each line of text. So, using --process-paragraphs you will get different output, hopefully the markov learner will be more accurate.

A very big issue is the word separator. Which you can specify via --separator '\s+' (note the single quote, you need to protect that backslash from the shell). Or make it like --separator '[-'"'"'"\:,;\s\t.]+' which separates on space or those punctuation marks. I find that this is good enough for my needs and anything more complex should go to another module or use another module to convert a corpus to pure text.

Anyway, this works as expected:

bin/analyse_text.pl --input-corpus data/2.short.shelley.txt --ngram-le +ngth 8 ... "elixir|of|life|is|a|chimera|but|these" => 1, "the|elixir|of|life|is|a|chimera|but" => 1, ... # and bin/analyse_text.pl --input-corpus data/2.short.shelley.txt --ngram-le +ngth 8 --process-paragraphs ... "been|done|exclaimed|the|soul|of|Frankenstein|more" => 1, "done|exclaimed|the|soul|of|Frankenstein|more|far" => 1, "exclaimed|the|soul|of|Frankenstein|more|far|more" => 1, "Frankenstein|more|far|more|will|I|achieve|treading" => 1, "has|been|done|exclaimed|the|soul|of|Frankenstein" => 1, "of|Frankenstein|more|far|more|will|I|achieve" => 1, "soul|of|Frankenstein|more|far|more|will|I" => 1, "the|soul|of|Frankenstein|more|far|more|will" => 1, ...

You will also see that I have included a new goodie: bin/analyze_image.pl and bin/predict_image.pl which does not work (yet?) as expected. But it does open a whole new can of exciting possibilities ...

The distribution can be installed from : https://github.com/hadjiprocopis/Algorithm-Markov-Multiorder-Learner (for the moment).

Thanks for your feedback (which has been acknowledged in pod) and let me know if you need or discover something else, bw bliako

Replies are listed 'Best First'.
Re^9: n-dimensional statistical analysis of DNA sequences (or text, or ...)
by Aldebaran (Curate) on Jul 05, 2019 at 23:30 UTC
    Thanks for your feedback (which has been acknowledged in pod) and let me know if you need or discover something else,

    To have a mention in the gitworld is quite gratifying. Maybe all of this work going through _Intermediate Perl_ is starting to pay-off.

    I have thoroughly enjoyed replicating your work and wish that I could get it all. Sometimes they are over my head. I have many of the same questions that I've had for a while now, and this might showcase how to answer a couple of them categorically.

    Let's start the usual way:

    $ pwd /home/bob/Documents/meditations/Algorithm-Markov-Multiorder-Learner-ma +ster $ perl Makefile.PL Checking if your kit is complete... Warning: the following files are missing in your kit: .BACKUP/28.06.2019_13.28.tgz data/2.short.shelley.txt data/ShelleyFrankenstein.txt ... states/small.fa.state Please inform the author. Generating a Unix-style Makefile Writing Makefile for Algorithm::Markov::Multiorder::Learner Writing MYMETA.yml and MYMETA.json

    So there's ten files or so that we get warnings for their absence. If I'm to guess, you might struggle with how much of an example file to leave for the distribution. I think data/2.short.shelley.txt in its 2 paragraphs is perfect. The entire shelley text is probably too much. I must say, it is a great text to work with. Let me say this though, without an existing data/2.short.shelley.txt , your test fails.

    My BIG QUESTION about these distributions has been "where do you put the data?" If I put it in a place so that the install test passes as one would expect from such source:

    $ pwd /home/bob/Documents/meditations/Algorithm-Markov-Multiorder-Learner-ma +ster/t $ cat 01-learn-text.t #!/usr/bin/env perl use strict; use warnings; use lib 'blib/lib'; use Test::More; use Algorithm::Markov::Multiorder::Learner; my $num_tests = 0; my $input_corpus_filename = 'data/2.short.shelley.txt'; ...

    Then I can't figure out how to do the paths on the command line. Consider this the base:

    $ pwd /home/bob/Documents/meditations/Algorithm-Markov-Multiorder-Learner-ma +ster $ ls bin Changes lib Makefile.PL MYMETA.json output README + xt blib data Makefile MANIFEST MYMETA.yml pm_to_blib t $ cd data $ ls 1.mp.txt 2.short.shelley.txt 'Untitled Document 1' 1.pope.txt 3.Shelley_short.txt 2.Shelley_short.txt ShelleyFrankenstein.txt

    Our files are in a child directory named data .

    $ cd .. $ pwd /home/bob/Documents/meditations/Algorithm-Markov-Multiorder-Learner-ma +ster $ cd blib/ $ cd script/ $ pwd /home/bob/Documents/meditations/Algorithm-Markov-Multiorder-Learner-ma +ster/blib/script $ ls analyse_DNA_sequence.pl analyse_text.pl predict_text.pl analyse_image.pl predict_image.pl read_state.pl $

    Meanwhile our scripts are in a grandchild directory. And this per the makefile:

    cp bin/analyse_text.pl blib/script/analyse_text.pl "/usr/bin/perl" -MExtUtils::MY -e 'MY->fixin(shift)' -- blib/script/an +alyse_text.pl

    Am I correct that this script needs to be run in

    Algorithm-Markov-Multiorder-Learner-master/blib/script

    Forgive me if I missed this, but can you lay out an example that works for invoking analyse_text.pl ?

      Now in github you will find the data and states dirs as well. So it should not complain about missing files hopefully.

      You don't need to run the script INSIDE Algorithm-Markov-Multiorder-Learner-master/blib/script. Without installing it, it suffices you to be inside Algorithm-Markov-Multiorder-Learner-master and run it like blib/script/analyse_text.pl . If you install it (make install) then it should be in a dir which is in your path. Then just call it analyse_text.pl . If you want to install it in a user-dir then uncomment the INSTALL_BASE from Makefile.PL and add ~/usr/bin to your path.

        Now in github you will find the data and states dirs as well.

        It dramatically increases the size of the distro, but you're still less than a meg:

        $ ls A* -l -rw-rw-r-- 1 bob bob 938559 Jul 6 17:44 'Algorithm-Markov-Multiorder- +Learner-master(1).zip' -rw-rw-r-- 1 bob bob 28855 Jun 28 14:11 Algorithm-Markov-Multiorder- +Learner-master.zip $
        If you install it (make install) then it should be in a dir which is in your path. Then just call it analyse_text.pl

        This, I did not realize...

        $ pwd /home/bob/Documents/meditations/Algorithm-Markov-Multiorder-Learner-ma +ster/data $ analyse_text.pl --input-corpus 2.short.shelley.txt --ngram-length 8 + --output-state 2.short.state >8.txt $

        and bingo:

        $ cat 8.txt { "counts" => { "air|we|breathe|They|have|acquired|new|and" + => 1, "already|marked|I|will|pioneer|a|new|way" + => 1,