chinamox has asked for the wisdom of the Perl Monks concerning the following question:

Hello monks,

I am working on a pretty nifty assignment involving the using the Markov Chain Algorithm to generate random text from an imported text file. This is done by counting a certain number (in this case two) of words that occur before a third word. It records the number of times $word3 occurs before $word1 and $word2. it then check to see what other words occur before $word1 and $word2, and records them.

In the second half of the program it will then use these frequencies to render the pseudo text.

Dr. Seuss seems to be best for showing this…

__DATA__ I will not eat them in a bar. I will not eat them in a car. I will not eat them Sam I am. I will not eat green eggs and ham! __END__

What I need help with is selecting a data structure that will let me save the matching word pairs (‘I will’ is the first pair) and the third word(‘not’) and record the number of times these parings occur and the percentage of times $word_three occurs before $word1_word2.

Thus the resulting data in the data might look something like this.

I will not (4, 100%) Will not eat (4, 100%) Not eat them(3, 75%) green (1, 25%) Eat them in (2, 66.6%) Sam (1, 33.3%) Them in a (2, 100%) In a bar (1, 50%) car (1, 50%) A bar I (1, 100%) Bar I will (1, 100%) A car I (1, 100%) Them Sam I (1, 100%) Sam I am. (1, 100%) I am. I (1, 100%) Am. I will (1, 100%) Green eggs and (1, 100%) Eggs and ham! (1, 100%) ( the read loop/subroutine terminates do to lack of new words.)

Since I think I will be accessing these parings via the $word1-$word2 parings to generate the output, I am inclined to use a hash of arrays.

Does this sound sane or am I running in the wrong direction?

This is my first attempt at using an advanced data structure in Perl, and while I am confident of my ability to pass the data into the program and parse it, I am still unsure about the best way to store the parsed data.

As always, thank you for your guidance.

-mox

Replies are listed 'Best First'.
Re: Trying to select the best data structure
by bobf (Monsignor) on Oct 29, 2006 at 03:37 UTC

    Your text and example differ slightly. In multiple places the text indicates word 3 comes before words 1 and 2, but your example shows word 3 coming after words 1 and 2. I am assuming you want the latter, but in reality it doesn't impact the choice of data structure.

    There are a lot of ways to approach this, but I might use something like

    $hash{$word1}{$word2}{$word3} = $count
    where $count is the number of times $word3 is found immediately after (or before) words 1 and 2.

    Note that I did not store the percentage, which can be calculated after all of the counts have been accumulated. If you wanted a more complicated structure to facilitate that, you could do something like

    $hash{$word1}{$word2}{$word3} = { count => $count, freq => $frequency }
    or perhaps even better
    $hash{$word1}{$word2} = { total_counts => $tot_counts, words => { $word3 => $count, } }
    In the last example you would increment the count for the appropriate $word3 as well as the total_counts for the word pair. This would allow you to easily calculate the frequency for any given $word3 on the fly.*

    To populate the data structure, simply identify words 1, 2, and 3 and then do

    $hash{$word1}{$word2}{total_counts}++; $hash{$word1}{$word2}{words}{$word3}++;
    (or similar, depending on which one you choose).

    These structures may be able to be simplified, depending on your requirements. On the other hand, if you had a lot of data and expected the structure to become quite large, a database may be a better choice.

    Note that I did not use any arrays. Since the description of the problem implied that you would be looking up data by $word, a hash seemed more appropriate. An array (or an array of hash refs) would require you to search through the contents of the array to find the $word of interest, and depending on the size of the arrays that could be a very inefficient approach.

    *If you really wanted to store the frequency of each word and you didn't need a running total, I'd suggest doing the calculation at the end and using $word3 => { count => $count, freq  => $frequency } rather than $word3 => $count, which would make adding the frequencies later a bit more straightforward.

      Thank you very much for your time and thoughts. I went back after reading your excellent explination and looked over the data structures cookbook. I now feel much more informed about Perl's data structures and understand why arrays would not work well in this situation.

      cheers,
      -mox
Re: Trying to select the best data structure
by GrandFather (Saint) on Oct 29, 2006 at 05:25 UTC

    I'd be inclined to use a hash containing pairs of counts and following word hashes. The following works for generating data sets of arbitary orders:

    use strict; use warnings; use Data::Dump::Streamer; use constant ORDER => 3; my %chains; while (<DATA>) { my @words = split; my @chain; for (@words) { push @chain, $_; next if @chain < ORDER; shift @chain if @chain > ORDER; my $root; for my $index (0 .. ORDER - 1) { if (defined $root) { ++$root->[1]{$chain[$index]}[0]; $root = $root->[1]{$chain[$index]}; } else { ++$chains{$chain[$index]}[0]; $root = $chains{$chain[$index]}; } } } } Dump (\%chains); __DATA__ I will not eat them in a bar. I will not eat them in a car. I will not eat them Sam I am. I will not eat green eggs and ham!

    DWIM is Perl's answer to Gödel

      Once again your time and efforts are clear and helpful. Thank you for both.

      -mox
Re: Trying to select the best data structure
by Khen1950fx (Canon) on Oct 29, 2006 at 03:17 UTC
    I think that you'll need to read the Perl Data Structures Cookbook. I use it all the time. See: perldsc

      Thank you for the link. It was very useful.

      -mox
Re: Trying to select the best data structure
by Not_a_Number (Prior) on Oct 29, 2006 at 08:56 UTC

    One question: does the order of the pair ($word1, $word2) matter? In other words (making the same assumption as to what you mean as bobf above), is:

    I will not

    equivalent to:

    will I not

    for your purposes, or is it different?

    Update: Since you say "I think I will be accessing these parings via the $word1-$word2 parings", why not use an HoH, with $word1 and $word2 stringified as the key to the outer hash, and populate it something like this:

    # foreach group of three words $hash{$word1 . ' ' . $word2}{$word3}++;

    Or, more generally, for any number of words in your 'chain':

    # foreach group of n words in array @n_words my $nth_word = pop @n_words; $hash{join ' ', @n_words}{$nth_word}++;

    And if the answer to my question in the first part of this post is that order doesn't matter, just change that last line to:

          $hash{join ' ', sort @n_words}{$nth_word}++;

      Thank you very much. I am will read up on HoH's and use your suggestion in my final solution! Looking more into whether word order is important, I have concluded that it is, and so I will be using the $word1 and $word2 stringfield as a key to the out ther hash as you suggested. Thank you again for your help!

      -mox