Re: COVID-19 data analytics, cryptography, and some things you should know

I'll preface with: I'm not a professional cryptographer. The closest I've gotten is understanding and implementing parts of a blockchain prototype.

So, I'll start with the basics. As has been pointed out, your IP validation is a bit sloppy. However, I'm going to take the opposite view to answers previous to this: you don't really care. If you're getting this from your web server, it's 100% clean. The only possible use for validation is to tell you that your code is bad. Since we already know that IPv4 addresses have dots, whereas IPv6 addresses have colons, we can do this even simpler:

my $af = $ip =~ /\./ ?
  AF_INET : AF_INET6;
my $bytes = pack("H32 a* H32", $packing[0], inet_pton( $af, $ip ), $pa
+cking[7]);
return sha3_256_hex($bytes);
[download]

If it's invalid, inet_pton will likely barf, but that's fine, it's never going to be invalid because it's not coming from a user (unlike, say, the browser string).

Now on to your hashing choices. First off, your salts. You are using two salts - $packing[0] and $packing[7]. I'm not sure why. I've never seen anyone do that before. A single salt suffices. For your purposes, if I'm understanding this, you need to use the same salt for everything, which is unfortunate in that a rainbow table can be trivially produced if the salt is known and the algorithm is known. Your database must not, therefore, have the salt in it. That salt is the equivalent of a secret key, and must be guarded as such. (Thus, don't use the ones you just published here.) It does not go into a public post, a public git repository, or anything of the sort. It also does not get shared among the developers even in a private git repository. It gets stored separately, period.

If you could use a different salt for each key, that would change things, but then you could no longer correlate things based on browser/IP address since they'd result in different hashes.

Next, let's look at SHA-3. Reading on wikipedia indicates that typical x64 hardware that most of us are probably running, you're looking at about 12.6 cycles per byte for encryption. Your IP string is 48 bytes, so that's about 600 cycles to encrypt it. My system has 3.7GHz, so that's over 6 million hashes per second. Per CPU (I have six, but an attacker would have many more). But IPv6 is a very large space, combined with your browser string, this sounds like a lot of brute force required. Would a hacker be able to reduce this space? Yes, a lot.

First off, if I were to be looking for a specific person, there's a good chance I can figure out their IP address. And probably also their browser. So, if I get your secret key, I have a single hash to produce to find the key. However, if I don't have the key, I have to try all possible keys to look them up - though that only takes me about 3 minutes, if my math is working properly (I have 6 CPUs, 12 if you count hyperthreading). This is probably not a huge barrier. This is also why bitcoin rigs can cycle through so many hashes per second in an attempt to find the next block's solution.

Now, if I didn't know the browser string, I could probably guess it. There's not a lot of entropy here - "Mozilla", "Win32", ... there are only so many of these. Yes, some people can customise their browser strings, but almost no one (statistically speaking) does. So, knowing the IP address but not the secret key or the browser string, I will have to try all the possible secret keys times the number of browser strings I'm likely to encounter. If I really care, I just throw one machine at each browser string, and we're still talking minutes.

If I don't have the IP address either, now we're just brute-forcing the entire space (though we can still probably restrict the browser string to the likely culprits - we'll not decrypt everything, but we'll be able to use those to narrow everything down). But we still only have to do this once - once we find the secret keys on one hash, we have the secret keys on all hashes, and we can brute force everything else much faster.

What you probably want, then, is something that makes it cost-prohibitive to find the secret key in the first place. Instead of a handful of minutes, you need something where this takes days or weeks or months. When I worked on that blockchain, the solution to this was to switch the hashing to Argon 2, and my work PC could only manage about 4 hashings per second (and that used all 4 CPUs). Now going through all possibilities of secret keys will take ~1.5 million times longer. It'll take your server a bit more time to generate the hash as well, but you only have to create it once per entry, not billions of times. (The hashing difficulty for the argon-based blockchain was very very low compared to, say, bitcoin's.)

Other than that, I'm not seeing a lot of problem here. But, like I said, I'm not a professional cryptographer, so there may be something I'm missing as well.

Comment on Re: COVID-19 data analytics, cryptography, and some things you should know Select or Download Code


The stupid question is the question not asked
	PerlMonks