Dividing a file into groups of two words and counting them

Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: Dividing a file into groups of two words and counting them by apl (Monsignor) on May 14, 2008 at 16:13 UTC
Rather than constructing @word for each line, construct it for the entire file. Then rather than `foreach $word(@word) { $count{$word}++; }` [download] you could say `for $w (0 .. $#word) { $count{$word[$w].' '.$word[$w+1] }++; }` [download] Beware of having an odd number of words in the file; you might pretest and stick a blank in the final (even) entry. The code hasn't been tested, but I hope you get the idea.	[reply] [d/l] [select]
Re: Dividing a file into groups of two words and counting them by toolic (Bishop) on May 14, 2008 at 16:16 UTC
You could modify your loop as follows: `foreach my $sequence (@bigram) { #my $count; is this a bug? my @word = split ' ', $sequence; my $prev = shift @word; foreach $word (@word) { $count{"${prev}_${word}"}++; $prev = $word; } }` [download] If "test.txt" contains: `a b n z d f n z v` [download] this prints: `n_z occurs 2 times a_b occurs 1 times d_f occurs 1 times f_n occurs 1 times b_n occurs 1 times` [download] Is this what you are looking for? Note that `my $count;` seems to be a bug. Update: Another solution would be to try to `use List::MoreUtils qw(natatime);`	[reply] [d/l] [select]
Re: Dividing a file into groups of two words and counting them by johngg (Canon) on May 14, 2008 at 20:06 UTC
I may have misunderstood but I think you want each pair overlapping, i.e "this word and that" has three pairs viz. "this word", "word and" and "and that". If that is the case you can do a a global regex match using a capture for one word followed by non-word characters then a look-ahead assertion for the next word, also with a capture. use strict; use warnings; open my $inFH, q{<}, \ <<'END_OF_FILE' or die qq{open: $!\n}; peter piper picked a peck of pickled peppers a peck of pickled peppers peter piper picked if peter piper picked a peck of pickled peppers where's the peck of pickled peppers peter piper picked END_OF_FILE my $textFile = do { local $/; <$inFH>; }; close $inFH or die qq{close: $!\n}; my $rxWordPair = qr {(?x) ([\w'-]+) \W+ (?=([\w'-]+)) }; my %pairFrequencies; while ( $textFile =~ m{$rxWordPair}g ) { $pairFrequencies{ qq{$1 $2} } ++; } print map { qq{$_: $pairFrequencies{ $_ }\n} } sort { $pairFrequencies{$b} <=> $pairFrequencies{$a} \|\| $a cmp $b } keys %pairFrequencies; [download] This produces `of pickled: 4 peck of: 4 peter piper: 4 pickled peppers: 4 piper picked: 4 a peck: 3 peppers peter: 2 picked a: 2 if peter: 1 peppers a: 1 peppers where's: 1 picked if: 1 the peck: 1 where's the: 1` [download] I hope this is useful. Cheers, JohnGG	[reply] [d/l] [select]
Re: Dividing a file into groups of two words and counting them by pc88mxer (Vicar) on May 14, 2008 at 16:00 UTC
Here's a hint... suppose you could read in one word at a time. Then you could use a strategy like this: `my $last_word; sub process_word { my $word = shift; ...process it... $last_word = $word; } for every word in the file { process_word($word); }` [download] Now we just need a way to read every word in the file and sent it through the `process_word` routine. As a first step, why don't you write the code assuming that there is only one word per input line and get that working. Then see what you have to do to modify it when there are multiple words on an input line.	[reply] [d/l] [select]
Re: Dividing a file into groups of two words and counting them by Anonymous Monk on May 14, 2008 at 16:17 UTC
`my %pairs; local $/; my $data = <DATA>; my @words = split /\W+/, lc($data_; $pairs{ $words[$_].' '.$words[$_+1] }++ for 0..$#words-1; print "$pairs{$_}\t$_\n" for sort { $pairs{$b}<=>$pairs{$a} } keys %pairs; __DATA__ To be or not to be, that is the question Ask not what your country can do for you but what you can do for your +country I think therefore I am I drink therefore I am I'm drunk therfore I was Alas twas not to be` [download]	[reply] [d/l]
Re: Dividing a file into groups of two words and counting them by locked_user sundialsvc4 (Abbot) on May 15, 2008 at 12:43 UTC
The main-loop should simply scan the file, split each line, and call a subroutine for each word. Let's assume a global scalar `$previous_word` whose value is of-course initially `undef`. The subroutine checks this scalar to see if it is `defined()`. If so, it contains the first word and we've just been given the second, so tally it, then set `$previous_word` back to `undef`. Otherwise, store the word in `$previous_word` and do nothing more. Remember that you must account for the last word if the file contains an odd number. You can detect this case, after the loop ends, by checking to see if `$previous_word` is `defined()`.