in reply to De Duping Street Addresses Fuzzily

Honestly I think that a huge part of this problem could be solved by using the Database as effectively as possible. I am hoping that you are using a standard DB such as Oracle, MySQL, etc. The info, then, that you want to search through can be sorted using an "ORDER BY" statement on the address number and street name. This will give a sorted list that will basically group all the dupes next to each other.

As for the rest, all you need to do is match address (number + street name) and make a hash or some other data structure to store the accepted spellings of common street names or other address conventions. (such ast st. blvd. etcetera). You can have the program run through this hash and transform them to a common output and you should get duplicate output, with different primary keys.

This is a general overview of your problem. There are obviously going to be some edge cases come up when you tackle this problem. This is quite a large problem, but I think that if you are able to have the DB work for you it will simplify things tremendously.