foomatic99 has asked for the wisdom of the Perl Monks concerning the following question:
I have a database of href links, the link text and 255 characters of text surrounding the link in either direction.
I want to use this data to do clustering -- I want to take a given document and based on the link text to it return a (short) list of related documents, if any.
I can't quite get it to scale or work the way I want it to.
Maybe this is, um, a little open-ended but does anyone know what I should be doing?
The approach that comes to mind would be doing some sort of LSI / vector space search on words in the surrounding text and relating the URLs using that. Maybe this perl.com article and the references it gives will be of help.