I've been hoping you might come back and clarify your purpose for this.

My best guess is that you want to aggregate the headlines--for the same story but from different sources--together much in the fashion that google news does?

If that is the case, I think that you are barking up the wrong tree with hoping that any purely character-based algorithmic mechanism for comparing strings (like String::Similarity or Text::Levenstein or String::Approx etc.) will work.

For example, all of the above (though I haven't been able to test String::Similarity as it won't compile here) are likely to rate the following two pseudo headlines as being very similar:

Assuming case insensitive comparisons, they are both the same length; contain the same characters; and the few transpositions required to turn one into the other occur in close proximity. But their meanings are entirely different. As any human reader will spot immediately, they are very unlikely to relate to the same story.

Another example. Take a look at some of the 1,714 headlines relating to Google news' current top story. Whilst there are definite points of commonality between the great majority of them--'Bush', 'Zoellick', 'World', 'Bank'--, there are also those which have no discernible commonality at all. Eg. The Economist headlines the story with "Right second time".

Even amongst those that do have several keywords in common, the placement/relationship of those keywords relative to the other words making up the headlines are such that any algorithm that treats the comparison as simple strings of characters is unlikely to do a good job at grouping them.

To be in any way successful, it becomes necessary to break the strings into words and then compare them in terms of the similarity of the words contained rather than their relative placement. Of course, as others have mentioned, there are a whole bunch of common words (stop words) who's presence in a headline will do nothing useful in terms of grouping. They will be so common as to tend to create groupings (increase similarity) where none exists.

And that leads to my suggestion, assuming that this is what you are attempting to do. You need to group the headlines according to commonality of their least common words.

I would break the headlines into words and accumulate an occurrence count for each word across the entire corpus of the day's data (and probably accumulate these counts going forward). I would then 'tag' each headline using the two or three or four least common words it contains.

Headlines would then be clustered according to the number of least common words they have in common.

For example. Today, any headline containing the word 'Zoellick' is quite likely to relate to the new president of the World Bank. If it also contains any of the words 'bush', 'president', 'world' or 'bank' it is very likely to. And if it contains more than two, even more likely.

Of course, that won't help with maverick or cutesy (like the Economist's effort above), or punning headlines (as favoured by the tabloid press), but short of performing a similar 'least common words in common' process on the content of the article, there is little that can be done about that.

The good news is that if you went with this algorithm, you ought to be able to improve your performance markedly as it is a O( 2n ) process, rather than a O( m*n ) process.

Anyway, maybe the thoughts will help. Good luck.


Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.
"Too many [] have been sedated by an oppressive environment of political correctness and risk aversion."

In reply to Re: Fast string similarity method by BrowserUk
in thread Fast string similarity method by icanwin

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.