I have a database of href links, the link text and 255 characters of text surrounding the link in either direction.
I want to use this data to do clustering -- I want to take a given document and based on the link text to it return a (short) list of related documents, if any.
I can't quite get it to scale or work the way I want it to.
Maybe this is, um, a little open-ended but does anyone know what I should be doing?