bob,
You have a hard problem. It is easy for a human to see that those two headlines are related, but a program only does what you tell it. One approach may be:
For each headline -
- Create a 2 element array of first two words and entire headline
- Go through all previous full headlines to see if it has been seen already
- If yes - increment the count, if no - add it as a new item
The problem is that there is likely a high probability that two words will be present in two different headlines that are not related. Other approaches might be to split out the words, sort them, and look for the total number in common. In any case, you are not going to come up with a fool proof system. If the logic above is what you want and you can't figure it out, let me know and I can whip up something.
| [reply] |