foomatic99 has asked for the wisdom of the Perl Monks concerning the following question:

I'm trying to guess the correct titles of news articles based on the text of links to the article page. Often the link text will be the article's title exactly, and I can tell with fairly good precision how often a given site has linked to articles previously using their correct titles beforehand. This is important, because as an alternative many sites link to articles using the name of the news organization, which is a likely false positive from looking at news articles' HTML.

Example:

Site A: ...I was just reading that according to the <a href="...">New York Times</a> "Circuit City's Job Cuts Backfiring, Analysts Say" and it seems to me that...
Site B: ...I came across this article today, <a href="...">Circuit City's Job Cuts Backfiring, Analysts Say</a> in the New York Times so then I thought...

This is an irrelevant example in that NYTimes.com has good SEO and I wouldn't need to resort to this method to parse its titles. However, this being the case I can use it to tell which sites tend to use which linking convention at what probability in the past.

There's also the little problem that occasionally none of the links that I will be aware of may use the article title naming convention.

My first instinct is that the probability that a given string is the correct article title based on incoming links equals the product of the number of incoming links with that link text times the past probability that the sites of those links had used the article title as link text. However, the number of incoming links is obviously variable, and as the number of incoming links with the "correct" link text increases it seems the probability could only decrease with each additional link after the first one, which doesn't seem right.

I feel like there must be some formula for getting this right that eludes me, that someone with a tiny bit more literacy in probability can tell me?