Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Hi Monks,

Are there any perl modules to handle the anonymising of a list of names & addresses into readable yet undecodable values ?

Or maybe just a huge shuffle of the lot... ?

Or maybe just a evil train of tr/// statements ?

I thought of just Lingua::Bork'ing the lot but obviously its too close to the real values and the obligatory bork bork bork is going to fill out the fields.


But you get my drift.... votive candle at the ready.
Kevin

Replies are listed 'Best First'.
Re: Anonymising data
by Zaxo (Archbishop) on Oct 20, 2005 at 15:44 UTC

    Look into the Crypt namespace. Don't use any "looks good" thing you just dreamed up. With the modules you can use real encryption or digesting more easily than you can implement something you make up on the spot.

    After Compline,
    Zaxo

      I'm guessing here but I get the feeling he isn't so much trying to encrypt anything as to just take a list of names (and addresses) and munge them into a list of fictional names (and addresses) for use as examples or what not.

      Like, maybe turning something like this...

      Jonathan Walker 12 Cross St. Hazard, KY 41701 James Beam 81 Donut Circle What Cheer, IA 50268 John Daniels 1 Lonely Dr. Solitude, IN 47620
      into something like this...
      James Daniels 81 Donut Dr Solitude, IN 47620 Jonathon Beam 12 Lonely Circle Hazard, KY 41701 John Walker 1 Cross St. What Cheer, IA 50268
      It's hard to say from his post, but that's how I read it. I wouldn't even try without seeing some sample data though.

      -sauoq
      "My two cents aren't worth a dime.";
      
        Bingo,

        This is exactly what I'm trying to do, Data Protection Act and all that, sorry for the lack of clarity there. There is no requirement to un-mung the data - in fact, if it can be unmunged, its bad.

        I thought this exact style of munging might be in a module already as it seems a standard thing you would want to do when dealing with sensitive test data.

        I'll write a small module and post for comments.

        Cheers

        Kevin

Re: Anonymising data
by Limbic~Region (Chancellor) on Oct 20, 2005 at 16:06 UTC
    Anonymous Monk,
    You are going to need to clarify your requirements. Your use of readable implies to me that you want to use letters, numbers, and punctuation. Your use of undecodable implies that you do not want to make the process reversable.

    This sounds like a hashing algorithm (1 way encryption), but I am not too sure. You could just as easily mean that you want the process to be reverseable but not without the secret or a whole lot of detective work.

    Please clarify.

    Cheers - L~R

Re: Anonymising data
by Moron (Curate) on Oct 20, 2005 at 16:21 UTC
    Crypt::Enigma encrypts into a set found on an ordinary typewriter. But of course Enigma was cracked by the invention of the computer, which doesn't bode that well for security.

    On the other hand you could pick one of the block cipher formats from the same namespace, say 128 or 256 bit, and just recode each nybble into hexadecimal (0-9 A-F) format for readability, using the unpack and pack functions to make the readable/unreadable transition, before or after the encode/decode depending on whether you are encrypting or decrypting.

    -M

    Free your mind

Re: Anonymising data
by pboin (Deacon) on Oct 20, 2005 at 20:16 UTC

    Apparently, there's some confusion as to whether you want this output to be reversible or not. I have a hunch you just want some 'trash' data to muck around with. If so, take a peek at Data::Random. I've used it with success to make lots of test data.

Re: Anonymising data
by jgallagher (Pilgrim) on Oct 20, 2005 at 17:58 UTC
    I think the big question is whether or not you want to be able to go backwards, that is, do some processing on the "encrypted" names and addresses and then remap them to the originals. If you want to do that, follow some of the encryption modules other monks pointed you towards. If you don't, is there a reason you don't want to just create a random set of data? I.e., take a list of first and last names, street addresses, towns, etc., and just create a list of random groupings of them?
Re: Anonymising data
by QM (Parson) on Oct 21, 2005 at 13:41 UTC
    I don't think you want to use real, sensitive data to generate fake data. That could be brute forced back into the original.

    What you probably want instead is a list of possible values for each field, and combine values randomly to generate complete entries.

    Since you mention this is for test data, you should think about edge cases for each field, so that your lists are broad and your testing more robust.

    For instance, a name field could be one of these:

    John Smith Cher k d lang Mr. William Peterson III, Ph.D., M.D., J.D. The Mamas and the Papas Mssr. Jacque Blacque du Laurier, Esquire Hans-Peter van Scoter TAFKAP [The Artist Formerly Known As Prince] 8) [frog smiley] Steve & Sherry Smith Steve Smith & Sherry Shortcake [married, preserving surname] Tenchi Kanaka-san
    (Don't forget unicode, various Asian forms, and Celtic Rune forms.) If you really plan to test something, you should consider the boundaries of your input filter, and attack them appropriately. (Of course, if you allow data like this, parsing it into first name, last name, and titles will be daunting).

    -QM
    --
    Quantum Mechanics: The dreams stuff is made of