Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Hi there, I have been tearing my hair out trying to find some information on tools or techniques for masking production data for use in system testing. Can anyone help me by posting/e-mailing scripts, snippets or suggestions? It would help me get a big ugly monkey otherwise known as my boss off my back for a while, and I'd be eternally grateful :)

Replies are listed 'Best First'.
Re: Test data masking tools/techniques?
by dws (Chancellor) on Apr 01, 2003 at 02:50 UTC
    There are two approaches you might try:

    First, assuming you have mutual non-disclosure agreements with your customers, find one that will let you use a chunk of their data. Explain that you want to do performance tuning on representative data, and that if you use their data, they'll see the biggest benefit.

    Failing that, pick some subset of their data, and obscure the strings. The last time I did something like this, we calculated MD5 hashes of their strings, then used substrings that were the same length as the original. It made for messy, unreadable data, but was good enough for our (performance tuning) purposes.


    Update: I forgot to mention. When you use MD5 hashes (or substrings), foreign key integrity is maintained. That is, if you've got customer data with textual keys, and references to those keys from other tables, when you go the MD5 route you won't lose those references.

Re: Test data masking tools/techniques?
by graff (Chancellor) on Apr 01, 2003 at 04:15 UTC
    The trick mentioned earlier about using md5 hashes and cutting them up into suitable lengths seems like a nifty idea, if a bit heavy-weight.

    If it would be sufficient to replace alphabetics and digits with randomly selected other alphabetics and digits (to mask personal-id info like names and credit-card numbers, but not disrupt the actual character class relations), something like this might do (not tested):

    sub mask_it { my ($instr) = @_; my $retstr = ''; while ($instr =~ /[0-9a-z]/i) { if (/^([^0-9a-z]*)([a-z]+)/i) { $retstr .= $1; #pass non-alphas as-is my $orig = $2; #replace alphas with new ones $retstr .= join('',map { chr(65+int(rand(26))) } split(//, +$orig)); } elsif (/^([^0-9a-z]*)(\d+)/i { $retstr .= $1; #pass non-digits as-is my $orig = $2; #replace digits with new ones $retstr .= join('',map { chr(48+int(rand(10))) } split(//, +$orig)); } $instr =~ s/^(\W*)$orig//; #remove from input } $retstr .= $instr; #pass on anything that's left over return $retstr; }
    Note that this only outputs upper-case replacements for any input letters; if you want to be more "flexible", it should be easy to add that in. (There are probably a few ways to optimize this, but this gets the basic idea across.)

    update: added the "join('',...)" around each "map {...}" to make sure the string assignment would work properly. Also changed the "while" condition from /\w/ to /[0-9a-z]/i (and similarly for the "if" condidtions), to make sure that underscores don't throw it into an endless loop.

Re: Test data masking tools/techniques?
by grantm (Parson) on Apr 01, 2003 at 01:09 UTC

    Perhaps we could be more helpful if you explained what you mean by "masking production data". Are you talking about manipulating real live data in some way to create test data which is very similar to real-world data but without the risk of exposing real private data to testers?

      Yes, that's what I'm looking to do. Sorry for the lack of clarity. I actually don't agree with this assignment, as I feel that anything I can find on the 'Net would need to be havily modified to fit the data it will be dealing with here, but my boss has insisted I search for existing tools. I personally feel that techniques are going to be of greater assistance to me :/

        Well, I'm not aware of any tools for doing that and I didn't manage to pull anything up with a Google search (I guess you didn't either). I imagine such tools would be very domain-specific.

        As far as techniques go, one approach to taking a randomish sample is to select every 'nth' row from a table (using % (modulo) and the row number). If you can slurp a dataset into memory, you could use Shuffle to randomise it. For randomising names, you could do a random sample of first names, then a random sample of last names using a different seed or modulo and then paste them together.