I'm storing email bodies to database. Because I'm a big fan of deduplication... I ask the following. How can I get a numeric representation of a utf8 encoded string (email body)? The goal being to search for a similar body and do a diff instead of re-storing the entire email body. Which seems wasteful. Thanks in advance.
Update: I know I can manually do a Levenshtein algorithm against all previously stored bodies. But that seems to defeat the purpose... as it would be very wasteful.
Hmmm... maybe convert to hex and then to decimal? I don't know.