Anonymising data

Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: Anonymising data by Zaxo (Archbishop) on Oct 20, 2005 at 15:44 UTC
Look into the Crypt namespace. Don't use any "looks good" thing you just dreamed up. With the modules you can use real encryption or digesting more easily than you can implement something you make up on the spot. After Compline, Zaxo	[reply]
Re^2: Anonymising data by sauoq (Abbot) on Oct 20, 2005 at 16:41 UTC
I'm guessing here but I get the feeling he isn't so much trying to encrypt anything as to just take a list of names (and addresses) and munge them into a list of fictional names (and addresses) for use as examples or what not. Like, maybe turning something like this... `Jonathan Walker 12 Cross St. Hazard, KY 41701 James Beam 81 Donut Circle What Cheer, IA 50268 John Daniels 1 Lonely Dr. Solitude, IN 47620` [download] into something like this... `James Daniels 81 Donut Dr Solitude, IN 47620 Jonathon Beam 12 Lonely Circle Hazard, KY 41701 John Walker 1 Cross St. What Cheer, IA 50268` [download] It's hard to say from his post, but that's how I read it. I wouldn't even try without seeing some sample data though. -sauoq "My two cents aren't worth a dime.";	[reply] [d/l] [select]
Re^3: Anonymising data by Anonymous Monk on Oct 21, 2005 at 09:08 UTC
Bingo, This is exactly what I'm trying to do, Data Protection Act and all that, sorry for the lack of clarity there. There is no requirement to un-mung the data - in fact, if it can be unmunged, its bad. I thought this exact style of munging might be in a module already as it seems a standard thing you would want to do when dealing with sensitive test data. I'll write a small module and post for comments. Cheers Kevin	[reply]
Re: Anonymising data by Limbic~Region (Chancellor) on Oct 20, 2005 at 16:06 UTC
Anonymous Monk, You are going to need to clarify your requirements. Your use of readable implies to me that you want to use letters, numbers, and punctuation. Your use of undecodable implies that you do not want to make the process reversable. This sounds like a hashing algorithm (1 way encryption), but I am not too sure. You could just as easily mean that you want the process to be reverseable but not without the secret or a whole lot of detective work. Please clarify. Cheers - L~R	[reply]
Re: Anonymising data by Moron (Curate) on Oct 20, 2005 at 16:21 UTC
Crypt::Enigma encrypts into a set found on an ordinary typewriter. But of course Enigma was cracked by the invention of the computer, which doesn't bode that well for security. On the other hand you could pick one of the block cipher formats from the same namespace, say 128 or 256 bit, and just recode each nybble into hexadecimal (0-9 A-F) format for readability, using the unpack and pack functions to make the readable/unreadable transition, before or after the encode/decode depending on whether you are encrypting or decrypting. -M Free your mind	[reply]
Re: Anonymising data by pboin (Deacon) on Oct 20, 2005 at 20:16 UTC
Apparently, there's some confusion as to whether you want this output to be reversible or not. I have a hunch you just want some 'trash' data to muck around with. If so, take a peek at Data::Random. I've used it with success to make lots of test data.	[reply]
Re: Anonymising data by jgallagher (Pilgrim) on Oct 20, 2005 at 17:58 UTC
I think the big question is whether or not you want to be able to go backwards, that is, do some processing on the "encrypted" names and addresses and then remap them to the originals. If you want to do that, follow some of the encryption modules other monks pointed you towards. If you don't, is there a reason you don't want to just create a random set of data? I.e., take a list of first and last names, street addresses, towns, etc., and just create a list of random groupings of them?	[reply]
Re: Anonymising data by QM (Parson) on Oct 21, 2005 at 13:41 UTC
I don't think you want to use real, sensitive data to generate fake data. That could be brute forced back into the original. What you probably want instead is a list of possible values for each field, and combine values randomly to generate complete entries. Since you mention this is for test data, you should think about edge cases for each field, so that your lists are broad and your testing more robust. For instance, a name field could be one of these: `John Smith Cher k d lang Mr. William Peterson III, Ph.D., M.D., J.D. The Mamas and the Papas Mssr. Jacque Blacque du Laurier, Esquire Hans-Peter van Scoter TAFKAP [The Artist Formerly Known As Prince] 8) [frog smiley] Steve & Sherry Smith Steve Smith & Sherry Shortcake [married, preserving surname] Tenchi Kanaka-san` [download] (Don't forget unicode, various Asian forms, and Celtic Rune forms.) If you really plan to test something, you should consider the boundaries of your input filter, and attack them appropriately. (Of course, if you allow data like this, parsing it into first name, last name, and titles will be daunting). -QM -- Quantum Mechanics: The dreams stuff is made of	[reply] [d/l]