in reply to Word Pairs and Lines

Trying to rethink this... First, these aren't sentences. They're lists of headlines -- so phrases, each ending with a hard return.

Second, the first script I posited above works just fine in listing the various word pairs and their frequency. So that's not a problem. The PROBLEM I'm having is massive redundancy. Below is a short example of the word pairs found, and their frequency - output from the script above.

OPEN SOURCE 9
WINDOWS XP 8
NERO BURNING 7
BURNING ROM 7
FLAW FOUND 6

Pairs 3 and 4 refer to the same headline. It's something like "Nero Burning ROM." I'd like the script to produce only one pair for each headline. So that once "Nero Burning" is output, "Burning Rom" is recognized as redundant and deleted.

Now there may be an easier way to do this than what I asked for above. As I said, my thinking maybe wasn't straight enough. Possibly a second script, which takes the output file, wordburst.txt, and removes all pairs where there is in the second pair a word which appeared in a previous pair. I've tried to formulate a regex to do this, but no luck.....

Replies are listed 'Best First'.
Re^2: Word Pairs and Lines
by TedPride (Priest) on Oct 09, 2004 at 08:37 UTC
    Hmm. So what you want is the first word pair in each sentence - but a count for that pair across all sentences?