in reply to Fuzzy text matching... again

This seems like a really interesting problem, and one that i am sure that a lot of people are working on in one form or another - it reminds me of semantics / natural language processing, and this is where i worry - This is NASA-level stuff!

Is there any way to simplify your problem? If the strings are generally short, Rata's suggestions could be a good start, but with longer strings the complexity of the comparisons would rapidly increase - say for example comparing the existence and ordering of words / sub-strings.

Again though we come back to the same problems though, because you can compare them all you like, but ultimately you need some 'threshold' or series of conditions which you accept as a match, which as you already point out, is difficult (compare the 'Aberdeen' example with the 'ePub archive' one!

As has already been pointed out, without *understanding* the text, this is nigh impossible... and if you achieved a workable (not even fast) solution, great things await! So i guess I would start with something tangible like binning / flagging the compared pairs into groups of 'similar differences' - i.e. "Interpolated word", "Different word order", "Appended", "Pre-pended" etc...

At least then if you think of anything clever to distinguish real and false from particular class of comparisons, it is easier to slot in the code...

This really does sound like an interesting and relevant problem, but i should imagine it will get very complex, very fast, unless you can simplify your criteria beyond "obviously the same" and vice versa!

Just a something something...

Replies are listed 'Best First'.
Re^2: Fuzzy text matching... again
by Your Mother (Archbishop) on Jan 08, 2010 at 00:41 UTC

    This is NASA-level stuff!

    Well... but so is this.