Re: Duplicate detection (SQL)

You will find the currently unpublished module Algorithm::HowSimilar at Re: Module for comparing text. It leverages Algorithm::Diff which is just awesome for all sorts of stuff like this. Also perhaps Re: Closest match Display may be of interest. There is plenty of discussion of the options in the associated threads.

To stop straight dups use primary keys or unique indexes and let the RDBMS stop it. All you have to do then is catch the exception and do whatever.

While the problem as you describe it may have a solution of sorts using some of these tools the real issue is how you aviod iterating over ALL the existing data to make sure the new data 'is similar/not too similar' whatever that means.

Not the same, the same are easy. The problem with similarity is that while you can check for it using the approaches above you need to do it every time against the set of data in the DB. In short it won't scale unless you can hone down what you want more preciesly and preferably work out a hashing algorithm that effectively distills a structure into a form you can INDEX.

This is effectively really a problem in the search space. It may be that you need to index your data and use the new data as the 'search terms' in some way. If you get a close match (in search terms) then....

cheers

tachyon

s&&rsenoyhcatreve&&&s&n.+t&"$'$`$\"$\&"&ee&&y&srve&&d&&print

Comment on Re: Duplicate detection (SQL)