Word Pairs and Lines

bob has asked for the wisdom of the Perl Monks concerning the following question:

This script, originally from http://www-2.cs.cmu.edu/~rosie/, and modified a bit by me, finds word pairs in a file and counts them. What I want it to do is find ONLY a single word pair from each line in the file it looks at. So the the line "Joe works hard" will only return "Joe works" as a word pair, and not also return "works hard." In other words, I only want it to find one pair per line. Heh! I'm no Perl Guru,so keep it as simple as possible, and I thank you in advance for your help. So any ideas?

#!/usr/bin/perl -w

print "Content-type: text/html\n\n";

$databasefile = "/home/virtual/admin16/var/www/cgi-bin/DB_Search/Data_
+files/top.xml";
$wordfile = "/home/virtual/admin16/var/www/html/news/wordburst.txt";

my $file = $databasefile;

open (FILE,$file) || die "Cannot read from $file";
flock(FILE, 2); # Locking file

open (OUTFILE, ">$wordfile") || die "error opening $wordfile $!\n";

$lastword = "BEGINNING_OF_TEXT";
$wordcounts{$lastword}++;
while (<FILE>) {


    s/[^\w\s]//g;
    foreach $word (split /\s+/) {
    # we only want to deal with normal words
    # replace all non-alphabetic characters in the word 
    $word =~ s/\W//g;
    # add one to the count of each word in this file
    # the curly brackets mean an associative array;
    # indexed by the word name
    $wordcounts{$word}++;
    $totalwords++;
    # we can make an associative array on the pair of words
    # if it's the first time we've seen this pair,
    # record how to split it back into two words
    $word_pair_counts{"$lastword,$word"}++
        or
        $word_pair_split{"$lastword,$word"} = [ $lastword, $word ];
    # now remember what word we saw last for the next pair
    $lastword = $word;
    }
}


# now look at the most frequent word pairs

 
 $word_pairs_printed = 0;
 foreach  (sort  { $word_pair_counts{$b} <=> $word_pair_counts{$a} } k
+eys %word_pair_counts) {
     ($word1, $word2) = split(/,/);
     
       
        printf OUTFILE ("\ $word1 $word2 $word_pair_counts{$_}\n");

          $word_pairs_printed++;
     # last ends the loop
     last if ($word_pairs_printed > 39);
 }
    
    flock(FILE, 8);                # Unlocking file

    close(OUTFILE);
[download]

Comment on Word Pairs and Lines Download Code

Replies are listed 'Best First'.
Re: Word Pairs and Lines by ikegami (Patriarch) on Oct 08, 2004 at 20:07 UTC
How about something like: `my ($word1, $word2) = split(/\s+/); if defined($word2) { ... }` [download] instead of: `foreach $word (split /\s+/) { ... }` [download] In other words, don't loop over every word. Get the first two words of every line, and work with those.	[reply] [d/l] [select]
Re: Word Pairs and Lines by jeffa (Bishop) on Oct 08, 2004 at 20:24 UTC
Interesting ... I had to waste some time on this one. ;) Now then, if you do perform a `$word =~ s/\W//g;` on the entire sentance, how do you know when the sentance ends? You have to keep some punctuation around. Anyways, try this out. Hopefully some other monks will have better answers for you, but this is a simple approach. #!/usr/bin/perl -l use strict; use warnings; use Data::Dumper; my $data = do {local $/;<DATA>}; my @sent = split /[.!?]\B/,$data; my @parsed; for my $i (0 .. $#sent) { next if $sent[$i] =~ /^$/; my @word = map $_ \|\| (), split /\s+/,$sent[$i]; for (my $j = 0; $j < @word; $j += 2) { push @{ $parsed[$i] }, [ $word[$j], $word[$j+1] ]; } } print Dumper \@parsed; # second pair from second sentance (should be 'a test') print join ' ', @{$parsed[1]->[1]}; __DATA__ This is Joe. This is a test. This is not a test. Blah blah isn't this fun? I wish you were here! Nah ... [download] jeffa L-LL-L--L-LL-L--L-LL-L-- -R--R-RR-R--R-RR-R--R-RR B--B--B--B--B--B--B--B-- H---H---H---H---H---H--- (the triplet paradiddle with high-hat)	[reply] [d/l] [select]
Re: Word Pairs and Lines by TedPride (Priest) on Oct 08, 2004 at 22:17 UTC
I'm assuming you want to count how many of each word pair is in the document, and rank the words by number found. I'm also assuming that a line break is always the end of a sentence, and that you want periods to signify the end of a sentence. use strict; my (@lines, @words, $i, $pair, %hash); foreach (<DATA>) { $_ =~ s/[^\w\. ]//g; # Remove unneeded characters $_ =~ s/ +/ /g; # Many spaces to one $_ =~ s/ ?\. ?(\. ?)*/\./g; # Boundaries with . change to . $_ =~ s/^ //; $_ =~ s/ $//; # Spaces at start and end removed $_ = lc($_); # Lowercase @lines = split(/\./, $_); # Split on sentence boundaries foreach (@lines) { # Get words for each sentence @words = split(/[\. ]/, $_); for ($i = 0; $i < $#words; $i++) { # For each word pair $pair = @words[$i] . ' ' . @words[$i+1]; $hash{$pair}++; # Increment count for word pair } } } foreach (sort {$hash{$b} <=> $hash{$a}} keys %hash) { print $_ . ' ' . $hash{$_} . "\n"; } __DATA__ Four score and seven years ago our fathers brought forth, upon this co +ntinent, a new nation, conceived in liberty, and dedicated to the pro +position that "all men are created equal" Now we are engaged in a great civil war, testing whether that nation, +or any nation so conceived, and so dedicated, can long endure. We are + met on a great battle field of that war. We have come to dedicate a +portion of it, as a final resting place for those who died here, that + the nation might live. This we may, in all propriety do. But, in a l +arger sense, we can not dedicate -- we can not consecrate -- we can n +ot hallow, this ground -- The brave men, living and dead, who struggl +ed here, have hallowed it, far above our poor power to add or detract +. The world will little note, nor long remember what we say here; whi +le it can never forget what they did here. It is rather for us, the living, we here be dedicated to the great tas +k remaining before us -- that, from these honored dead we take increa +sed devotion to that cause for which they here, gave the last full me +asure of devotion -- that we here highly resolve these dead shall not + have died in vain; that the nation, shall have a new birth of freedo +m, and that government of the people by the people for the people, sh +all not perish from the earth. [download]	[reply] [d/l]
Re: Word Pairs and Lines by bob (Novice) on Oct 09, 2004 at 03:29 UTC
Trying to rethink this... First, these aren't sentences. They're lists of headlines -- so phrases, each ending with a hard return. Second, the first script I posited above works just fine in listing the various word pairs and their frequency. So that's not a problem. The PROBLEM I'm having is massive redundancy. Below is a short example of the word pairs found, and their frequency - output from the script above. OPEN SOURCE 9 WINDOWS XP 8 NERO BURNING 7 BURNING ROM 7 FLAW FOUND 6 Pairs 3 and 4 refer to the same headline. It's something like "Nero Burning ROM." I'd like the script to produce only one pair for each headline. So that once "Nero Burning" is output, "Burning Rom" is recognized as redundant and deleted. Now there may be an easier way to do this than what I asked for above. As I said, my thinking maybe wasn't straight enough. Possibly a second script, which takes the output file, wordburst.txt, and removes all pairs where there is in the second pair a word which appeared in a previous pair. I've tried to formulate a regex to do this, but no luck.....	[reply]
Re^2: Word Pairs and Lines by TedPride (Priest) on Oct 09, 2004 at 08:37 UTC
Hmm. So what you want is the first word pair in each sentence - but a count for that pair across all sentences?	[reply]
Re: Word Pairs and Lines by bob (Novice) on Oct 08, 2004 at 21:00 UTC
I see my question was poorly formulated. I want the highest frequency pairs to float to the top, so the first two words won't do. Let me come back to you with a better way to put this.... Might be that the whole doc will have to be parsed as in the original script, then the results worked over...	[reply]
Re: Word Pairs and Lines by The_Rabbit (Acolyte) on Oct 08, 2004 at 20:02 UTC
What I want it to do is find ONLY a single word pair from each line in the file it looks at. So the the line "Joe works hard" will only return "Joe works" as a word pair, and not also return "works hard." In other words, I only want it to find one pair per line. I'm sort of confused by this statement. Is there any criteria for selecting a word pair? Or do you always want to select the first two words on a line as the word pair?	[reply]
Re: Word Pairs and Lines by bob (Novice) on Oct 09, 2004 at 03:44 UTC
Just to continue a bit, Then, if I had a headline like "Nero Burning Rom Released Today," I'd wind up with: Nero Burning Burning Rom Rom Released Released Today And all except the first would be deleted because they are redundant with the first, or with subsequent redundancies...???? (Hmmmm.... might want it to work backwards... deleting the first initially.... So redundancies would be matched....??? Not sure that's necessary...)	[reply]
Re: Word Pairs and Lines by bob (Novice) on Oct 09, 2004 at 04:07 UTC
The reason I can't go with the first two words only is that I might have a headline like "Nero Burning Rom Released Today" and then another like "New Release of Nero Burning Rom Out." I'd want to pick up and count a matched pair from both.	[reply]
Re^2: Word Pairs and Lines by Limbic~Region (Chancellor) on Oct 09, 2004 at 13:47 UTC
bob, You have a hard problem. It is easy for a human to see that those two headlines are related, but a program only does what you tell it. One approach may be: For each headline - Create a 2 element array of first two words and entire headline Go through all previous full headlines to see if it has been seen already If yes - increment the count, if no - add it as a new item The problem is that there is likely a high probability that two words will be present in two different headlines that are not related. Other approaches might be to split out the words, sort them, and look for the total number in common. In any case, you are not going to come up with a fool proof system. If the logic above is what you want and you can't figure it out, let me know and I can whip up something. Cheers - L~R	[reply]
Re: Word Pairs and Lines by bob (Novice) on Oct 09, 2004 at 23:09 UTC
Oh Ted--you said that too... Sorry.. I missed your post	[reply]
Re: Word Pairs and Lines by bob (Novice) on Oct 09, 2004 at 15:09 UTC
L~R, if I'm following you, your suggestion may be the closest yet to what I want. I'll think about that. Not sure about the mechanics.... Is this what you mean? Take first two words, count the frequency of each such word pair against the headlines.... (BTW, this would be done after removing "stopwords" (and, the, a, and the rest of a long list)	[reply]