in reply to Re^4: n-dimensional statistical analysis of DNA sequences (or text, or ...)
in thread n-dimensional statistical analysis of DNA sequences (or text, or ...)

I've tried to replicate and extend the first results I got, but I don't seem to be calling this properly. The logic for Getopt::Long seems pretty opaque to me now. Do you see why this is fundamentally different than what worked before?

$ ./2.analyse_text.pl --input-corpus 84-0.txt --ngram-length 4 --outp +ut-state 4.shelley.state args are --input-corpus 84-0.txt --ngram-length 4 --output-state 4.she +lley.state Unknown option: input-corpus Unknown option: ngram-length Unknown option: output-state Usage : ./2.analyse_text.pl <options> Something wrong with command-line parameters... $ cat 2.analyse_text.pl #!/usr/bin/env perl # FILE: analyse_text.pl # by bliako use 5.011; use warnings; use Getopt::Long; use Data::Dump qw/dump/; use lib '.'; use Markov::Ndimensional; my @args = @ARGV; say "args are @args"; my $input_corpus_filename = undef; my $input_state_filename = undef; my $output_state_filename = undef; my $output_stats_filename = undef; my $separator = '\s'; my $internal_separator = '|'; my $seed = undef; my $num_iterations = 100; if( ! Getopt::Long::GetOptions( 'input-state=s' => \$input_state_filename, 'separator=s' => \$separator, 'num-iterations=i' => $num_iterations, 'seed=s' => \$seed, 'help|h' => sub { print STDERR usage($0); exit(0) } ) ){ print STDERR usage($0) . "\n\nSomething wrong with command-line p +arameters...\n"; exit(1); } ...
  • Comment on Re^5: n-dimensional statistical analysis of DNA sequences (or text, or ...)
  • Download Code

Replies are listed 'Best First'.
Re^6: n-dimensional statistical analysis of DNA sequences (or text, or ...)
by bliako (Abbot) on Feb 07, 2019 at 11:37 UTC

    yes. You call analyse_text.pl correctly but the Getopt::Long parameters in the code you are pasting does not belong to that script. It belongs to predict_text.pl

    let me know if you need more help

      let me know if you need more help

      Thx, bliako, I want to understand ngrams better now that I'm calling the scripts correctly again. Your example in for Getopt::Long in my recent thread Re: chunking up texts correctly for online translation got me through the bog and back on pavement. These scripts have abundant output, which I've edited down. I will always post the source that created it. I do have questions too, but the whole enchilada definitely calls for readmore tags:

      Just another perl meditation....

        Aldebaran thank you for testing this module so thoroughly and persistently.

        I have made some modifications so that it uses johngg's method of getting a sliding window of $ngram_length words at a time from Re: improving speed in ngrams algorithm. My original code was not sliding at all! Thanks for pointing it out.

        I have also added another feature --process-paragraphs which will first convert the text to paragraphs using this algorithm: replace all single newlines with a space so that repeated newlines will be the boundary separating paragraphs. And apply the sliding ngram window on paragraphs rather than lines which is the default. When in "lines" mode, the sliding window does not take into account previous, next lines. But it will stop at the end of the line. Then start a new sliding window at next line. I have noticed that your test input (shelley) has newlines separating each line of text. So, using --process-paragraphs you will get different output, hopefully the markov learner will be more accurate.

        A very big issue is the word separator. Which you can specify via --separator '\s+' (note the single quote, you need to protect that backslash from the shell). Or make it like --separator '[-'"'"'"\:,;\s\t.]+' which separates on space or those punctuation marks. I find that this is good enough for my needs and anything more complex should go to another module or use another module to convert a corpus to pure text.

        Anyway, this works as expected:

        bin/analyse_text.pl --input-corpus data/2.short.shelley.txt --ngram-le +ngth 8 ... "elixir|of|life|is|a|chimera|but|these" => 1, "the|elixir|of|life|is|a|chimera|but" => 1, ... # and bin/analyse_text.pl --input-corpus data/2.short.shelley.txt --ngram-le +ngth 8 --process-paragraphs ... "been|done|exclaimed|the|soul|of|Frankenstein|more" => 1, "done|exclaimed|the|soul|of|Frankenstein|more|far" => 1, "exclaimed|the|soul|of|Frankenstein|more|far|more" => 1, "Frankenstein|more|far|more|will|I|achieve|treading" => 1, "has|been|done|exclaimed|the|soul|of|Frankenstein" => 1, "of|Frankenstein|more|far|more|will|I|achieve" => 1, "soul|of|Frankenstein|more|far|more|will|I" => 1, "the|soul|of|Frankenstein|more|far|more|will" => 1, ...

        You will also see that I have included a new goodie: bin/analyze_image.pl and bin/predict_image.pl which does not work (yet?) as expected. But it does open a whole new can of exciting possibilities ...

        The distribution can be installed from : https://github.com/hadjiprocopis/Algorithm-Markov-Multiorder-Learner (for the moment).

        Thanks for your feedback (which has been acknowledged in pod) and let me know if you need or discover something else, bw bliako