in reply to Fast string similarity method
I've been hoping you might come back and clarify your purpose for this.
My best guess is that you want to aggregate the headlines--for the same story but from different sources--together much in the fashion that google news does?
If that is the case, I think that you are barking up the wrong tree with hoping that any purely character-based algorithmic mechanism for comparing strings (like String::Similarity or Text::Levenstein or String::Approx etc.) will work.
For example, all of the above (though I haven't been able to test String::Similarity as it won't compile here) are likely to rate the following two pseudo headlines as being very similar:
Assuming case insensitive comparisons, they are both the same length; contain the same characters; and the few transpositions required to turn one into the other occur in close proximity. But their meanings are entirely different. As any human reader will spot immediately, they are very unlikely to relate to the same story.
Another example. Take a look at some of the 1,714 headlines relating to Google news' current top story. Whilst there are definite points of commonality between the great majority of them--'Bush', 'Zoellick', 'World', 'Bank'--, there are also those which have no discernible commonality at all. Eg. The Economist headlines the story with "Right second time".
Even amongst those that do have several keywords in common, the placement/relationship of those keywords relative to the other words making up the headlines are such that any algorithm that treats the comparison as simple strings of characters is unlikely to do a good job at grouping them.
To be in any way successful, it becomes necessary to break the strings into words and then compare them in terms of the similarity of the words contained rather than their relative placement. Of course, as others have mentioned, there are a whole bunch of common words (stop words) who's presence in a headline will do nothing useful in terms of grouping. They will be so common as to tend to create groupings (increase similarity) where none exists.
And that leads to my suggestion, assuming that this is what you are attempting to do. You need to group the headlines according to commonality of their least common words.
I would break the headlines into words and accumulate an occurrence count for each word across the entire corpus of the day's data (and probably accumulate these counts going forward). I would then 'tag' each headline using the two or three or four least common words it contains.
Headlines would then be clustered according to the number of least common words they have in common.
For example. Today, any headline containing the word 'Zoellick' is quite likely to relate to the new president of the World Bank. If it also contains any of the words 'bush', 'president', 'world' or 'bank' it is very likely to. And if it contains more than two, even more likely.
Of course, that won't help with maverick or cutesy (like the Economist's effort above), or punning headlines (as favoured by the tabloid press), but short of performing a similar 'least common words in common' process on the content of the article, there is little that can be done about that.
The good news is that if you went with this algorithm, you ought to be able to improve your performance markedly as it is a O( 2n ) process, rather than a O( m*n ) process.
Anyway, maybe the thoughts will help. Good luck.
|
|---|