Re: Anonymising data

I don't think you want to use real, sensitive data to generate fake data. That could be brute forced back into the original.

What you probably want instead is a list of possible values for each field, and combine values randomly to generate complete entries.

Since you mention this is for test data, you should think about edge cases for each field, so that your lists are broad and your testing more robust.

For instance, a name field could be one of these:

John Smith
Cher
k d lang
Mr. William Peterson III, Ph.D., M.D., J.D.
The Mamas and the Papas
Mssr. Jacque Blacque du Laurier, Esquire
Hans-Peter van Scoter
TAFKAP [The Artist Formerly Known As Prince]
8) [frog smiley]
Steve & Sherry Smith
Steve Smith & Sherry Shortcake [married, preserving surname]
Tenchi Kanaka-san
[download]

(Don't forget unicode, various Asian forms, and Celtic Rune forms.) If you really plan to test something, you should consider the boundaries of your input filter, and attack them appropriately. (Of course, if you allow data like this, parsing it into first name, last name, and titles will be daunting).

-QM
--
Quantum Mechanics: The dreams stuff is made of

Comment on Re: Anonymising data Download Code