Re^3: How to remove duplicates from a large set of keys

Whether you have your million records in memory (fast) or on disk in a database (slow), you have to take the time to insert your new data. Looking up existing data is different - as explained, looking up in a hash is O(1): you take the key, perform a calculation on it (which is dependant on the length of the key, not the size of the hash), and go to that entry in the (associative) array. Looking up in a database cannot be any faster than O(1). It can be as bad as O(log N) (I can't imagine any database doing an index lookup any slower than a binary search), which is dependant on the number of data points you're comparing to.

The only way that a database could be faster is if it's a big honkin' box with lots of RAM, and that's a different box from your perl client.

This problem is one of the primary reasons to use a hash. (Not the only one, but one of them nonetheless.)

Comment on Re^3: How to remove duplicates from a large set of keys