Re^6: return primary key if duplicate entry exists?

CRC32 should only be used as a checksum to verify integrity of the data, but not as a key since there is no guarantee whatsoever that it will be unique for each different input and the result is only 32 bits long (4 bytes).

What you need is a message digest. Have a look at Digest::SHA1. The digest function will return a 20 byte binary or 40 byte hexadecimal result that still isn't guaranteed to be unique for each different input but given that its result is now 160 bits long, the risk of a collision (i.e. the same digest value for a different input) is much smaller. Anyhow, if two DNA sequences have a different digest value, they are guaranteed to be different. If two sequences have the same digest value, they can still be different (this is called "collision") and you should check the full DNA sequence to make sure they are different or not.

But as even the SHA1 digest does not guarantee "uniqueness" it cannot be used as a key in your database. In such cases, you should think of an auto-incrementing primary key and save both the full DNA sequence and its digest in the database. The digest can be used as an index to quickly check if the full DNA sequence is unique or already known and stored in the database. If you find a duplicate digest value then you must check the full DNA sequence to make sure it is not a (rare) collision case.

CountZero

A program should be light and agile, its subroutines connected like a string of pearls. The spirit and intent of the program should be retained throughout. There should be neither too little or too much, neither needless loops nor useless variables, neither lack of structure nor overwhelming rigidity." - The Tao of Programming, 4.1 - Geoffrey James

My blog: Imperial Deltronics

Comment on Re^6: return primary key if duplicate entry exists?