in reply to Word Pairs and Lines

The reason I can't go with the first two words only is that I might have a headline like "Nero Burning Rom Released Today" and then another like "New Release of Nero Burning Rom Out." I'd want to pick up and count a matched pair from both.

Replies are listed 'Best First'.
Re^2: Word Pairs and Lines
by Limbic~Region (Chancellor) on Oct 09, 2004 at 13:47 UTC
    bob,
    You have a hard problem. It is easy for a human to see that those two headlines are related, but a program only does what you tell it. One approach may be:
      For each headline -
    • Create a 2 element array of first two words and entire headline
    • Go through all previous full headlines to see if it has been seen already
    • If yes - increment the count, if no - add it as a new item
    The problem is that there is likely a high probability that two words will be present in two different headlines that are not related. Other approaches might be to split out the words, sort them, and look for the total number in common. In any case, you are not going to come up with a fool proof system. If the logic above is what you want and you can't figure it out, let me know and I can whip up something.

    Cheers - L~R