Re^7: n-dimensional statistical analysis of DNA sequences (or text, or ...)

let me know if you need more help

Thx, bliako, I want to understand ngrams better now that I'm calling the scripts correctly again. Your example in for Getopt::Long in my recent thread Re: chunking up texts correctly for online translation got me through the bog and back on pavement. These scripts have abundant output, which I've edited down. I will always post the source that created it. I do have questions too, but the whole enchilada definitely calls for readmore tags:

I'm using a short selection of shelley's frankenstein as input. Starting out at the gate, ngram=2 looks about right:

args are --input-corpus 2.short.shelley.txt --ngram-length 2 --output-
+state 2.short.state
{
  "counts" => {
    "achieve|treading"         => 1,
    "acquired|new"             => 1,
     ...big snip of similar...
    "words|of"                 => 1,
    "works|in"                 => 1,
  },
  "cum-twisted-dist" => {
    achieve      => ["treading", 1],
    acquired     => ["new", 1],
    already      => ["marked", 1],
    and          => [
                      "the",
                      0.142857142857143,
                      "performed",
                      0.285714285714286,
                      "that",
                      0.428571428571429,
                      "even",
                      0.571428571428571,
                      "show",
                      0.714285714285714,
                      "their",
                      0.857142857142857,
                      "almost",
                      1,
                    ],
    As           => ["he", 1],
    as           => ["if", 1],
    ascend       => ["into", 1],
    be           => ["transmuted", 1],
    blood        => ["circulates", 1],
    ...big snip...
    with         => ["a", 0.5, "its", 1],
    words        => ["of", 1],
    works        => ["in", 1],
  },
  "dist" => {
    "achieve|treading"         => 0.0087719298245614,
    "acquired|new"             => 0.0087719298245614,
     ...big snip...
    "words|of"                 => 0.0087719298245614,
    "works|in"                 => 0.0087719298245614,
  },
  "N" => 2,
}
./2.analyse_text.pl : done.
[download]

We see that As and as are reckoned differently, for example. You can argue whether that's a bug or a feature, but there isn't anything outlandish about the output in this case.

But then I started cranking up on the number of ngrams...3..4..8.... By the time I see the output for ngram=8, I have reservations:

args are --input-corpus 2.short.shelley.txt --ngram-length 8 --output-
+state 2.short.state
{
  "counts" => {
    "already|marked|I|will|pioneer|a|new|way"                         
+  => 1,
    "crucible|have|indeed|performed|miracles|They|penetrate|into"     
+  => 1,
    "Frankenstein|more|far|more|will|I|achieve|treading"              
+  => 1,
    "heavens|they|have|discovered|how|the|blood|circulates"           
+  => 1,
    "mock|the|invisible|world|with|its|own|shadows"                   
+  => 1,
    "of|nature|and|show|how|she|works|in"                             
+  => 1,
    "one|purpose|So|much|has|been|done|exclaimed"                     
+  => 1,
    "only|made|to|dabble|in|dirt|and|their"                           
+  => 1,
    "promised|impossibilities|and|performed|nothing|The|modern|masters
+" => 1,
    "promise|very|little|they|know|that|metals|cannot"                
+  => 1,
    "sounded|and|soon|my|mind|was|filled|with"                        
+  => 1,
    "Such|were|the|professor|s|words|rather|let"                      
+  => 1,
    "they|can|command|the|thunders|of|heaven|mimic"                   
+  => 1,
    "the|air|we|breathe|They|have|acquired|new"                       
+  => 1,
    "The|ancient|teachers|of|this|science|said|he"                    
+  => 1,
    "the|elixir|of|life|is|a|chimera|but"                             
+  => 1,
    "the|fate|enounced|to|destroy|me|As|he"                           
+  => 1,
    "touched|which|formed|the|mechanism|of|my|being"                  
+  => 1,
    "unfold|to|the|world|the|deepest|mysteries|of"                    
+  => 1,
    "went|on|I|felt|as|if|my|soul"                                    
+  => 1,
    "were|grappling|with|a|palpable|enemy|one|by"                     
+  => 1,
  },
  "cum-twisted-dist" => {
    "already|marked|I|will|pioneer|a|new"                       => ["w
+ay", 1],
    "crucible|have|indeed|performed|miracles|They|penetrate"    => ["i
+nto", 1],
    "Frankenstein|more|far|more|will|I|achieve"                 => ["t
+reading", 1],
    "heavens|they|have|discovered|how|the|blood"                => ["c
+irculates", 1],
    "mock|the|invisible|world|with|its|own"                     => ["s
+hadows", 1],
    "of|nature|and|show|how|she|works"                          => ["i
+n", 1],
    "one|purpose|So|much|has|been|done"                         => ["e
+xclaimed", 1],
    "only|made|to|dabble|in|dirt|and"                           => ["t
+heir", 1],
    "promised|impossibilities|and|performed|nothing|The|modern" => ["m
+asters", 1],
    "promise|very|little|they|know|that|metals"                 => ["c
+annot", 1],
    "sounded|and|soon|my|mind|was|filled"                       => ["w
+ith", 1],
    "Such|were|the|professor|s|words|rather"                    => ["l
+et", 1],
    "they|can|command|the|thunders|of|heaven"                   => ["m
+imic", 1],
    "the|air|we|breathe|They|have|acquired"                     => ["n
+ew", 1],
    "The|ancient|teachers|of|this|science|said"                 => ["h
+e", 1],
    "the|elixir|of|life|is|a|chimera"                           => ["b
+ut", 1],
    "the|fate|enounced|to|destroy|me|As"                        => ["h
+e", 1],
    "touched|which|formed|the|mechanism|of|my"                  => ["b
+eing", 1],
    "unfold|to|the|world|the|deepest|mysteries"                 => ["o
+f", 1],
    "went|on|I|felt|as|if|my"                                   => ["s
+oul", 1],
    "were|grappling|with|a|palpable|enemy|one"                  => ["b
+y", 1],
  },
  "dist" => {
    "already|marked|I|will|pioneer|a|new|way"                         
+  => 0.0476190476190476,
    "crucible|have|indeed|performed|miracles|They|penetrate|into"     
+  => 0.0476190476190476,
    "Frankenstein|more|far|more|will|I|achieve|treading"              
+  => 0.0476190476190476,
    "heavens|they|have|discovered|how|the|blood|circulates"           
+  => 0.0476190476190476,
    "mock|the|invisible|world|with|its|own|shadows"                   
+  => 0.0476190476190476,
    "of|nature|and|show|how|she|works|in"                             
+  => 0.0476190476190476,
    "one|purpose|So|much|has|been|done|exclaimed"                     
+  => 0.0476190476190476,
    "only|made|to|dabble|in|dirt|and|their"                           
+  => 0.0476190476190476,
    "promised|impossibilities|and|performed|nothing|The|modern|masters
+" => 0.0476190476190476,
    "promise|very|little|they|know|that|metals|cannot"                
+  => 0.0476190476190476,
    "sounded|and|soon|my|mind|was|filled|with"                        
+  => 0.0476190476190476,
    "Such|were|the|professor|s|words|rather|let"                      
+  => 0.0476190476190476,
    "they|can|command|the|thunders|of|heaven|mimic"                   
+  => 0.0476190476190476,
    "the|air|we|breathe|They|have|acquired|new"                       
+  => 0.0476190476190476,
    "The|ancient|teachers|of|this|science|said|he"                    
+  => 0.0476190476190476,
    "the|elixir|of|life|is|a|chimera|but"                             
+  => 0.0476190476190476,
    "the|fate|enounced|to|destroy|me|As|he"                           
+  => 0.0476190476190476,
    "touched|which|formed|the|mechanism|of|my|being"                  
+  => 0.0476190476190476,
    "unfold|to|the|world|the|deepest|mysteries|of"                    
+  => 0.0476190476190476,
    "went|on|I|felt|as|if|my|soul"                                    
+  => 0.0476190476190476,
    "were|grappling|with|a|palpable|enemy|one|by"                     
+  => 0.0476190476190476,
  },
  "N" => 8,
}
./2.analyse_text.pl : done.
[download]

Source is:

#!/usr/bin/env perl

# FILE: analyse_text.pl
# by bliako

use 5.011;
use warnings;
use Getopt::Long;
use Data::Dump qw/dump/;
use lib '.';
use Markov::Ndimensional;

my @args = @ARGV;
say "args are @args";

my $input_corpus_filename = undef;
my $input_state_filename  = undef;
my $output_state_filename = undef;
my $output_stats_filename = undef;
my $separator             = '\s';
my $internal_separator    = '|';
my $ngram_length          = -1;
if (
  !Getopt::Long::GetOptions(
    'input-corpus=s' => \$input_corpus_filename,
    'input-state=s'  => \$input_state_filename,
    'output-state=s' => \$output_state_filename,
    'output-stats=s' => \$output_stats_filename,
    'ngram-length=i' => \$ngram_length,
    'separator=s'    => \$separator,
    'help|h'         => sub { print STDERR usage($0); exit(0) }
  )
  )
{
  print STDERR usage($0)
    . "\n\nSomething wrong with command-line parameters...\n";
  exit(1);
}

if ( $ngram_length <= 0 ) {
  print STDERR "$0 : ngram-length must be a positive integer.\n";
  exit(1);
}

my %params = ();
if ( defined($output_state_filename) ) { $params{'need'} = { 'all' => 
+1 } }
else                                   { $params{'avoid'} = { 'counts'
+ => 1 } }
my $state = undef;
if ( defined($input_state_filename) ) {
  $state = load_state($input_state_filename);
  if ( !defined($state) ) {
    print STDERR "$0 : call to " . 'load_state()' . " has failed.\n";
    exit(1);
  }
  $params{'counts'} = $state->{'counts'};
}
if ( defined($input_corpus_filename) ) {
  $state = learn(
    {
      %params,
      'ngram-length'            => $ngram_length,
      'separator'               => $separator,
      'internal-separator'      => $internal_separator,
      'remove-these-characters' => '[^a-zA-Z]',
      'input-filename'          => $input_corpus_filename,
    }
  );
  if ( !defined($state) ) {
    print STDERR "$0 : call to " . 'learn()' . " has failed.\n";
    exit(1);
  }
}
if ( !defined($state) ) {
  print STDERR "$0 : --input-state and/or --input-fasta must be specif
+ied.\n";
  exit(1);
}

if ( defined($output_state_filename) ) {
  if ( !save_state( $state, $output_state_filename ) ) {
    print STDERR "$0 : call to " . 'save_state()' . " has failed.\n";
    exit(1);
  }
}
if ( defined($output_stats_filename) ) {
  print Data::Dump::dump($state);
}
else {
  print Data::Dump::dump($state);
}
print "\n$0 : done.\n";
exit(0);

sub usage {
  return "Usage : $0 <options>\n";
}
1;
[download]

Here, I see the entire output, and this is the data set:

�The ancient teachers of this science,� said he,
�promised impossibilities and performed nothing. The modern masters
promise very little; they know that metals cannot be transmuted and th
+at
the elixir of life is a chimera but these philosophers, whose hands se
+em
only made to dabble in dirt, and their eyes to pore over the microscop
+e or
crucible, have indeed performed miracles. They penetrate into the rece
+sses
of nature and show how she works in her hiding-places. They ascend int
+o the
heavens; they have discovered how the blood circulates, and the nature
+ of
the air we breathe. They have acquired new and almost unlimited powers
+;
they can command the thunders of heaven, mimic the earthquake, and eve
+n
mock the invisible world with its own shadows.�

Such were the professor�s words�rather let me say such the words of
the fate�enounced to destroy me.  As he went on I felt as if my soul
were grappling with a palpable enemy; one by one the various keys were
touched which formed the mechanism of my being; chord after chord was
sounded, and soon my mind was filled with one thought, one conception,
one purpose.  So much has been done, exclaimed the soul of
Frankenstein�more, far more, will I achieve; treading in the steps
already marked, I will pioneer a new way, explore unknown powers, and
unfold to the world the deepest mysteries of creation.
[download]

I do not see this text broken up into 8-grams as I would expect it. How would I expect that to look? Here is the first paragraph in a modified script from Re: improving speed in ngrams algorithm that tybalt89 threw out there recently for benchmarking:

8-word ngrams of '�The ancient teachers of this science,� said he,
�promised impossibilities and performed nothing. The modern masters
promise very little; they know that metals cannot be transmuted and th
+at
the elixir of life is a chimera but these philosophers, whose hands se
+em
only made to dabble in dirt, and their eyes to pore over the microscop
+e or
crucible, have indeed performed miracles. They penetrate into the rece
+sses
of nature and show how she works in her hiding-places. They ascend int
+o the
heavens; they have discovered how the blood circulates, and the nature
+ of
the air we breathe. They have acquired new and almost unlimited powers
+;
they can command the thunders of heaven, mimic the earthquake, and eve
+n
mock the invisible world with its own shadows.�'
START INDEX: 0 :  �The ancient teachers of this science,� said he,
START INDEX: 1 :  ancient teachers of this science,� said he, �promise
+d
START INDEX: 2 :  teachers of this science,� said he, �promised imposs
+ibilities
START INDEX: 3 :  of this science,� said he, �promised impossibilities
+ and
START INDEX: 4 :  this science,� said he, �promised impossibilities an
+d performed
START INDEX: 5 :  science,� said he, �promised impossibilities and per
+formed nothing.
START INDEX: 6 :  said he, �promised impossibilities and performed not
+hing. The
START INDEX: 7 :  he, �promised impossibilities and performed nothing.
+ The modern
...
START INDEX: 112 :  the earthquake, and even mock the invisible world
START INDEX: 113 :  earthquake, and even mock the invisible world with
START INDEX: 114 :  and even mock the invisible world with its
START INDEX: 115 :  even mock the invisible world with its own
START INDEX: 116 :  mock the invisible world with its own shadows.�
--------------------
[download]

Source listing:

#!/usr/bin/env perl

use 5.026;
use warnings;

my $text = q{�The ancient teachers of this science,� said he,
�promised impossibilities and performed nothing. The modern masters
promise very little; they know that metals cannot be transmuted and th
+at
the elixir of life is a chimera but these philosophers, whose hands se
+em
only made to dabble in dirt, and their eyes to pore over the microscop
+e or
crucible, have indeed performed miracles. They penetrate into the rece
+sses
of nature and show how she works in her hiding-places. They ascend int
+o the
heavens; they have discovered how the blood circulates, and the nature
+ of
the air we breathe. They have acquired new and almost unlimited powers
+;
they can command the thunders of heaven, mimic the earthquake, and eve
+n
mock the invisible world with its own shadows.�};

for ( 1 .. 8 ) {
  say qq{$_-word ngrams of '$text'};
  say for nGramWords( $_, $text );
  say q{-} x 20;
}

sub nGramWords {
  my ( $nWords, $string ) = @_;

  my @words = split m{\s+}, $string;
  my $start = 0;
  my @nGrams;

  while ( scalar @words >= $nWords ) {
    push @nGrams, join q{ },
      qq{START INDEX: @{ [ $start ++ ] } : },
      @words[ 0 .. $nWords - 1 ];
    shift @words;
  }

  return @nGrams;
}

1;
[download]

While this development does not respect the sentence boundary, it would be more complete when it does. Even while excluding words that are separated by a period, we see many more 8-tuples than are included in bliako's, which also does not respect sentence boundaries. Without the inclusion of all n-tuples, then the calculations are all off by the factor of the undercount.

Does anyone see a data undercount?

How do we make these ngrams respect the end of sentence boundary?

Just another perl meditation....

Comment on Re^7: n-dimensional statistical analysis of DNA sequences (or text, or ...) Select or Download Code

Replies are listed 'Best First'.
Re^8: n-dimensional statistical analysis of DNA sequences (or text, or ...) by bliako (Abbot) on Jun 28, 2019 at 13:40 UTC
Aldebaran thank you for testing this module so thoroughly and persistently. I have made some modifications so that it uses johngg's method of getting a sliding window of `$ngram_length` words at a time from Re: improving speed in ngrams algorithm. My original code was not sliding at all! Thanks for pointing it out. I have also added another feature `--process-paragraphs` which will first convert the text to paragraphs using this algorithm: replace all single newlines with a space so that repeated newlines will be the boundary separating paragraphs. And apply the sliding ngram window on paragraphs rather than lines which is the default. When in "lines" mode, the sliding window does not take into account previous, next lines. But it will stop at the end of the line. Then start a new sliding window at next line. I have noticed that your test input (shelley) has newlines separating each line of text. So, using `--process-paragraphs` you will get different output, hopefully the markov learner will be more accurate. A very big issue is the word separator. Which you can specify via `--separator '\s+'` (note the single quote, you need to protect that backslash from the shell). Or make it like `--separator '[-'"'"'"\:,;\s\t.]+'` which separates on space or those punctuation marks. I find that this is good enough for my needs and anything more complex should go to another module or use another module to convert a corpus to pure text. Anyway, this works as expected: bin/analyse_text.pl --input-corpus data/2.short.shelley.txt --ngram-le +ngth 8 ... "elixir\|of\|life\|is\|a\|chimera\|but\|these" => 1, "the\|elixir\|of\|life\|is\|a\|chimera\|but" => 1, ... # and bin/analyse_text.pl --input-corpus data/2.short.shelley.txt --ngram-le +ngth 8 --process-paragraphs ... "been\|done\|exclaimed\|the\|soul\|of\|Frankenstein\|more" => 1, "done\|exclaimed\|the\|soul\|of\|Frankenstein\|more\|far" => 1, "exclaimed\|the\|soul\|of\|Frankenstein\|more\|far\|more" => 1, "Frankenstein\|more\|far\|more\|will\|I\|achieve\|treading" => 1, "has\|been\|done\|exclaimed\|the\|soul\|of\|Frankenstein" => 1, "of\|Frankenstein\|more\|far\|more\|will\|I\|achieve" => 1, "soul\|of\|Frankenstein\|more\|far\|more\|will\|I" => 1, "the\|soul\|of\|Frankenstein\|more\|far\|more\|will" => 1, ... [download] You will also see that I have included a new goodie: `bin/analyze_image.pl and bin/predict_image.pl` which does not work (yet?) as expected. But it does open a whole new can of exciting possibilities ... The distribution can be installed from : https://github.com/hadjiprocopis/Algorithm-Markov-Multiorder-Learner (for the moment). Thanks for your feedback (which has been acknowledged in pod) and let me know if you need or discover something else, bw bliako	[reply] [d/l] [select]
Re^9: n-dimensional statistical analysis of DNA sequences (or text, or ...) by Aldebaran (Curate) on Jul 05, 2019 at 23:30 UTC
Thanks for your feedback (which has been acknowledged in pod) and let me know if you need or discover something else, To have a mention in the gitworld is quite gratifying. Maybe all of this work going through _Intermediate Perl_ is starting to pay-off. I have thoroughly enjoyed replicating your work and wish that I could get it all. Sometimes they are over my head. I have many of the same questions that I've had for a while now, and this might showcase how to answer a couple of them categorically. Let's start the usual way: `$ pwd /home/bob/Documents/meditations/Algorithm-Markov-Multiorder-Learner-ma +ster $ perl Makefile.PL Checking if your kit is complete... Warning: the following files are missing in your kit: .BACKUP/28.06.2019_13.28.tgz data/2.short.shelley.txt data/ShelleyFrankenstein.txt ... states/small.fa.state Please inform the author. Generating a Unix-style Makefile Writing Makefile for Algorithm::Markov::Multiorder::Learner Writing MYMETA.yml and MYMETA.json` [download] So there's ten files or so that we get warnings for their absence. If I'm to guess, you might struggle with how much of an example file to leave for the distribution. I think data/2.short.shelley.txt in its 2 paragraphs is perfect. The entire shelley text is probably too much. I must say, it is a great text to work with. Let me say this though, without an existing data/2.short.shelley.txt , your test fails. My BIG QUESTION about these distributions has been "where do you put the data?" If I put it in a place so that the install test passes as one would expect from such source: `$ pwd /home/bob/Documents/meditations/Algorithm-Markov-Multiorder-Learner-ma +ster/t $ cat 01-learn-text.t #!/usr/bin/env perl use strict; use warnings; use lib 'blib/lib'; use Test::More; use Algorithm::Markov::Multiorder::Learner; my $num_tests = 0; my $input_corpus_filename = 'data/2.short.shelley.txt'; ...` [download] Then I can't figure out how to do the paths on the command line. Consider this the base: `$ pwd /home/bob/Documents/meditations/Algorithm-Markov-Multiorder-Learner-ma +ster $ ls bin Changes lib Makefile.PL MYMETA.json output README + xt blib data Makefile MANIFEST MYMETA.yml pm_to_blib t $ cd data $ ls 1.mp.txt 2.short.shelley.txt 'Untitled Document 1' 1.pope.txt 3.Shelley_short.txt 2.Shelley_short.txt ShelleyFrankenstein.txt` [download] Our files are in a child directory named data . `$ cd .. $ pwd /home/bob/Documents/meditations/Algorithm-Markov-Multiorder-Learner-ma +ster $ cd blib/ $ cd script/ $ pwd /home/bob/Documents/meditations/Algorithm-Markov-Multiorder-Learner-ma +ster/blib/script $ ls analyse_DNA_sequence.pl analyse_text.pl predict_text.pl analyse_image.pl predict_image.pl read_state.pl $` [download] Meanwhile our scripts are in a grandchild directory. And this per the makefile: `cp bin/analyse_text.pl blib/script/analyse_text.pl "/usr/bin/perl" -MExtUtils::MY -e 'MY->fixin(shift)' -- blib/script/an +alyse_text.pl` [download] Am I correct that this script needs to be run in `Algorithm-Markov-Multiorder-Learner-master/blib/script` Forgive me if I missed this, but can you lay out an example that works for invoking analyse_text.pl ?	[reply] [d/l] [select]
Re^10: n-dimensional statistical analysis of DNA sequences (or text, or ...) by bliako (Abbot) on Jul 06, 2019 at 00:43 UTC
Now in github you will find the data and states dirs as well. So it should not complain about missing files hopefully. You don't need to run the script INSIDE `Algorithm-Markov-Multiorder-Learner-master/blib/script`. Without installing it, it suffices you to be inside `Algorithm-Markov-Multiorder-Learner-master` and run it like `blib/script/analyse_text.pl` . If you install it (`make install`) then it should be in a dir which is in your path. Then just call it `analyse_text.pl` . If you want to install it in a user-dir then uncomment the `INSTALL_BASE` from Makefile.PL and add ~/usr/bin to your path.	[reply] [d/l] [select]
Re^11: n-dimensional statistical analysis of DNA sequences (or text, or ...) by Aldebaran (Curate) on Jul 07, 2019 at 01:17 UTC