Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

I'm trying to count the number of pairs of words in a text file and have come unstuck on the best way of donig so. What I want to do is to capture word a and then the next word (a+1)(count it) and then do that for the next word (which was a+1) and build a count of each set of two word pairings for the entire file. I can split it into single words but not a group of two.
#!perl use strict; use warnings; my $file = "test.txt"; open (IN, $file) || die "$file not found\n"; my @bigram = <IN>; close (IN); chomp @bigram; my %count; my $word; foreach my $sequence (@bigram) { my $count; my @word = split ' ', $sequence; foreach $word(@word) { $count{$word}++; } } foreach $word (sort by_count keys %count) { print "$word occurs $count{$word} times\n"; } #orders the count sub by_count { $count{$b} <=> $count{$a}; }

Replies are listed 'Best First'.
Re: Dividing a file into groups of two words and counting them
by apl (Monsignor) on May 14, 2008 at 16:13 UTC
    Rather than constructing @word for each line, construct it for the entire file. Then rather than
    foreach $word(@word) { $count{$word}++; }
    you could say
    for $w (0 .. $#word) { $count{$word[$w].' '.$word[$w+1] }++; }

    Beware of having an odd number of words in the file; you might pretest and stick a blank in the final (even) entry.

    The code hasn't been tested, but I hope you get the idea.

Re: Dividing a file into groups of two words and counting them
by toolic (Bishop) on May 14, 2008 at 16:16 UTC
    You could modify your loop as follows:
    foreach my $sequence (@bigram) { #my $count; is this a bug? my @word = split ' ', $sequence; my $prev = shift @word; foreach $word (@word) { $count{"${prev}_${word}"}++; $prev = $word; } }

    If "test.txt" contains:

    a b n z d f n z v

    this prints:

    n_z occurs 2 times a_b occurs 1 times d_f occurs 1 times f_n occurs 1 times b_n occurs 1 times

    Is this what you are looking for?

    Note that my $count; seems to be a bug.

    Update: Another solution would be to try to use List::MoreUtils qw(natatime);

Re: Dividing a file into groups of two words and counting them
by johngg (Canon) on May 14, 2008 at 20:06 UTC
    I may have misunderstood but I think you want each pair overlapping, i.e "this word and that" has three pairs viz. "this word", "word and" and "and that". If that is the case you can do a a global regex match using a capture for one word followed by non-word characters then a look-ahead assertion for the next word, also with a capture.

    use strict; use warnings; open my $inFH, q{<}, \ <<'END_OF_FILE' or die qq{open: $!\n}; peter piper picked a peck of pickled peppers a peck of pickled peppers peter piper picked if peter piper picked a peck of pickled peppers where's the peck of pickled peppers peter piper picked END_OF_FILE my $textFile = do { local $/; <$inFH>; }; close $inFH or die qq{close: $!\n}; my $rxWordPair = qr {(?x) ([\w'-]+) \W+ (?=([\w'-]+)) }; my %pairFrequencies; while ( $textFile =~ m{$rxWordPair}g ) { $pairFrequencies{ qq{$1 $2} } ++; } print map { qq{$_: $pairFrequencies{ $_ }\n} } sort { $pairFrequencies{$b} <=> $pairFrequencies{$a} || $a cmp $b } keys %pairFrequencies;

    This produces

    of pickled: 4 peck of: 4 peter piper: 4 pickled peppers: 4 piper picked: 4 a peck: 3 peppers peter: 2 picked a: 2 if peter: 1 peppers a: 1 peppers where's: 1 picked if: 1 the peck: 1 where's the: 1

    I hope this is useful.

    Cheers,

    JohnGG

Re: Dividing a file into groups of two words and counting them
by pc88mxer (Vicar) on May 14, 2008 at 16:00 UTC
    Here's a hint... suppose you could read in one word at a time. Then you could use a strategy like this:
    my $last_word; sub process_word { my $word = shift; ...process it... $last_word = $word; } for every word in the file { process_word($word); }
    Now we just need a way to read every word in the file and sent it through the process_word routine. As a first step, why don't you write the code assuming that there is only one word per input line and get that working. Then see what you have to do to modify it when there are multiple words on an input line.
Re: Dividing a file into groups of two words and counting them
by Anonymous Monk on May 14, 2008 at 16:17 UTC
    my %pairs; local $/; my $data = <DATA>; my @words = split /\W+/, lc($data_; $pairs{ $words[$_].' '.$words[$_+1] }++ for 0..$#words-1; print "$pairs{$_}\t$_\n" for sort { $pairs{$b}<=>$pairs{$a} } keys %pairs; __DATA__ To be or not to be, that is the question Ask not what your country can do for you but what you can do for your +country I think therefore I am I drink therefore I am I'm drunk therfore I was Alas twas not to be
Re: Dividing a file into groups of two words and counting them
by locked_user sundialsvc4 (Abbot) on May 15, 2008 at 12:43 UTC

    The main-loop should simply scan the file, split each line, and call a subroutine for each word.

    Let's assume a global scalar $previous_word whose value is of-course initially undef. The subroutine checks this scalar to see if it is defined(). If so, it contains the first word and we've just been given the second, so tally it, then set $previous_word back to undef. Otherwise, store the word in $previous_word and do nothing more.

    Remember that you must account for the last word if the file contains an odd number. You can detect this case, after the loop ends, by checking to see if $previous_word is defined().