in reply to word similarity measure

You need to define what you mean by "similarity".

At first glance words 1, 2. and 4 are 'similar' since they each have the same number of sub-components. A second glance reveals that words 1, 2, and 3 are 'similar' - they each contain '101'. And words 2 and 4 are 'similar', they are the only words that contain 148 and 131.

I suspect that once you have defined your terms, you will be able to write a function that takes two words and returns the degree of 'similarity' between them. Once you have all of the pair-wise ratings computed, sort() will let you rank the papers from most alike to least.

This sounds like the kind of problem a plagiarism detector is designed for.

----
I Go Back to Sleep, Now.

OGB

Replies are listed 'Best First'.
Re^2: word similarity measure
by planetscape (Chancellor) on Feb 28, 2009 at 04:28 UTC
    This sounds like the kind of problem a plagiarism detector is designed for.

    If, in fact, that is what the OP is after, s/he may benefit from looking at the nodes mentioned here: Re: Finding plagarized content.


    Update: I rather think, OTOH, that the OP may be looking for something more like Ted Pedersen's SenseClusters (more)...

    HTH,

    planetscape