foomatic99 has asked for the wisdom of the Perl Monks concerning the following question:

I have a database of href links, the link text and 255 characters of text surrounding the link in either direction.

I want to use this data to do clustering -- I want to take a given document and based on the link text to it return a (short) list of related documents, if any.

I can't quite get it to scale or work the way I want it to.

Maybe this is, um, a little open-ended but does anyone know what I should be doing?

Replies are listed 'Best First'.
Re: document clustering via link contexts
by Fletch (Bishop) on Apr 11, 2007 at 15:44 UTC

    The approach that comes to mind would be doing some sort of LSI / vector space search on words in the surrounding text and relating the URLs using that. Maybe this perl.com article and the references it gives will be of help.