COVID-19 data analytics, cryptography, and some things you should know

tachyon-II has asked for the wisdom of the Perl Monks concerning the following question:

Hi All,

It's been a decade or so but my love of Perl continues.

The background of the Why? for this question can be found at https://fixcovid19.com/about.html which I recommend you read, particularly if you take any medication. It might just save your life.

The data collection tool to which it applies can be found at https://fixcovid19.com. We are gathering this data because the US, Chinese and Canadian CDC's are not and when the Italians partially gathered this medication data it showed 73% of all COVID-19 deaths occurred in the ~3% of people taking two specific classes of medication. The Turkish data release a few days later found similarly - 68.8% of deaths occurred in this really small group.

The problem to be solved is the safe, deidentified release of IP addresses and Browser strings to fulfil the requirements of HIPAA, GDPR and CCPA. We simply do not gather any other PII (personally identifiable information) so it is impossible for this to leak. Age range, sex, disease severity and outcome and medications are the other data points.

These 2 identifiers will assist researchers in assessing if the crowdsourced data we are gathering is "gamed" or "believable". We have taken steps to make automated submission difficult, but as we hackers know, virtually nothing is impossible if you really put your mind to it...

So the task to hand is to convert an IP and Browser String into a cryptographically secure hash that can not be reversed or revealed with a rainbow table. IP addresses and Browser strings both exist in a small finite search space.

Given this data will be released publicly and is timestamped it is trivial for an attacker to correlate a known IP address and Browser string to these hashes. Given the secret packing data I don't believe this would allow any quantity of computational power to elucidate the packing data and thus create a lookup table, but if that is incorrect I would be good to know now and fix it before our impending first data release.

While I am expert in the field of medicine my knowledge of crypto and how you attack it is less. SHA3 was chosen for its resistance to length extension attacks and the packing data size to give enough random data for the resultant hash to spread evenly across that space. Maybe that's good enough, maybe it can/should/must be done better.

Here is a draft version of those hashing functions. Expert commentary appreciated, particularly from cryptographers.

#!/usr/bin/env perl

package SHA3;

use strict;
use warnings;

use Digest::SHA3 qw(sha3_256_hex);
use Digest::MD5 qw(md5_hex);
use Socket qw( inet_pton AF_INET AF_INET6 );

my @packing = qw( 
    fa13a941b76466850c2558d9ae5d969f
    e71ab0d8bb54c75b37ad23a449050121
    6736564ec6bc9bbc8ba42df565317443
    c3e088a5cf247ec0df971c5cb9ee6eec
    6cf20d548878cdd82b8f207192f58c80
    660a311b8d75d5fb28c73f7e2ec5d25e
    377f92899b81ad7c5e1d08b81ccc8904
    8e1f27dee8ae3374ae5c462adf37bba5
    ccd558ff6b9de48ca22023ead2dbd7a2
    ff228ef28ae8544155323180ba070d1b
);

print SHA3::sha3_ip('1.2.3.4'), "\n";
print SHA3::sha3_ip('1.2.3.5'), "\n";
print SHA3::sha3_ip('2001:0db8:0000:0000:0000:8a2e:0370:7334'), "\n";
print SHA3::sha3_ip('2001:0db8:0000:0000:0000:8a2e:0370:7335'), "\n";
print SHA3::sha3_bs('Mozilla'), "\n";
print SHA3::sha3_bs('Win32'), "\n";

=head 2 sha3_ip {

    Expects a dot quad or an IPv6 address and returns a 
    SHA3_256_hex string or null string for invalid input

=cut

sub sha3_ip {
    my $ip = shift;

    my $pack_format;
    if ( $ip =~ m/^\d+\.\d+\.\d+\.\d+$/ ) { 
        my $bytes = pack("H32 a4 H32", $packing[0], inet_pton( AF_INET
+, $ip ), $packing[7]);
        my $hash = sha3_256_hex($bytes);
        return $hash;
    }   
    elsif ( $ip =~ /^([0-9a-f]{0,4}:){0,7}([0-9a-f]{0,4})$/i ) { 
        my $bytes = pack("H32 a16 H32", $packing[0], inet_pton( AF_INE
+T6, $ip ), $packing[7]);
        my $hash = sha3_256_hex($bytes);
        return $hash;
    }   

    warn "Invalid IP:$ip\n";
    return ''; 
}

=head 2 sha3_bs {

    Expects a browser string and returns a 
    SHA3_256_hex string or a null string for invalid input

=cut


sub sha3_bs {
    my $bs = shift;
    unless (length $bs > 4 ) { 
        warn "Insufficient data in browser string $bs";
        return ''; 
    }   
    my $bytes = pack("H32 H32 H32", $packing[1], md5_hex($bs), $packin
+g[8]);
    my $hash = sha3_256_hex($bytes);
    return $hash;
}
[download]

Comment on COVID-19 data analytics, cryptography, and some things you should know Download Code

Replies are listed 'Best First'.
Re: COVID-19 data analytics, cryptography, and some things you should know by AnomalousMonk (Archbishop) on Apr 05, 2020 at 05:59 UTC
`$ip =~ m/^\d+\.\d+\.\d+\.\d+$/` If you're worried about a valid IPv4 dotted-decimal IP address, take a look at Regexp::Common::net (and at Regexp::Common to see how to invoke said module). And I'm sure there are other, similar CPAN resources available. Give a man a fish: `<%-{-{-{-<`	[reply] [d/l] [select]
Re^2: COVID-19 data analytics, cryptography, and some things you should know by tachyon-II (Chaplain) on Apr 05, 2020 at 11:28 UTC
There are multiple was to check an IP address and Regexp::Common:net would be one of the least efficient. It's rather missing to point of the task at hand. A far better, far faster, far lighter option would be Data::Validate::IP but, like I said, this is not a request for how to validate an IP. It's a request about hacking crytpo. The null return value (that goes into the DB is 100% about security and 0% about validity). The presented stings come from NGINX so don't really need any validation. They come from the socket directing the traffic...	[reply]
Re^3: COVID-19 data analytics, cryptography, and some things you should know by 1nickt (Canon) on Apr 05, 2020 at 12:36 UTC
"The presented stings come from NGINX so don't really need any validation." And yet, you have code that identifies the IP type by a regexp, and throws if it is not "valid." What do you think is "far better, far faster, far lighter" about `sub _slow_is_ipv4 { shift if ref $_[0]; my $value = shift; return undef unless defined($value); my (@octets) = $value =~ /^(\d{1,3})\.(\d{1,3})\.(\d{1,3})\.(\d{1, +3})$/; return undef unless (@octets == 4); foreach (@octets) { return undef if $_ < 0 \|\| $_ > 255; return undef if $_ =~ /^0\d{1,2}$/; } return join('.', @octets); }` [download] in Data::Validate::IP (which has a dependency on NetAddr::IP) as compared to the code in Regexp::Common::net? Snarky reply to AnomalousMonk's attempt to offer assistance. Downvoted. (And the reply to bliako.) The way forward always starts with a minimal test.	[reply] [d/l]
Re^4: COVID-19 data analytics, cryptography, and some things you should know by AnomalousMonk (Archbishop) on Apr 05, 2020 at 22:03 UTC
Re: COVID-19 data analytics, cryptography, and some things you should know by Tanktalus (Canon) on Apr 05, 2020 at 18:37 UTC
I'll preface with: I'm not a professional cryptographer. The closest I've gotten is understanding and implementing parts of a blockchain prototype. So, I'll start with the basics. As has been pointed out, your IP validation is a bit sloppy. However, I'm going to take the opposite view to answers previous to this: you don't really care. If you're getting this from your web server, it's 100% clean. The only possible use for validation is to tell you that your code is bad. Since we already know that IPv4 addresses have dots, whereas IPv6 addresses have colons, we can do this even simpler: `my $af = $ip =~ /\./ ? AF_INET : AF_INET6; my $bytes = pack("H32 a* H32", $packing[0], inet_pton( $af, $ip ), $pa +cking[7]); return sha3_256_hex($bytes);` [download] If it's invalid, `inet_pton` will likely barf, but that's fine, it's never going to be invalid because it's not coming from a user (unlike, say, the browser string). Now on to your hashing choices. First off, your salts. You are using two salts - `$packing[0]` and `$packing[7]`. I'm not sure why. I've never seen anyone do that before. A single salt suffices. For your purposes, if I'm understanding this, you need to use the same salt for everything, which is unfortunate in that a rainbow table can be trivially produced if the salt is known and the algorithm is known. Your database must not, therefore, have the salt in it. That salt is the equivalent of a secret key, and must be guarded as such. (Thus, don't use the ones you just published here.) It does not go into a public post, a public git repository, or anything of the sort. It also does not get shared among the developers even in a private git repository. It gets stored separately, period. If you could use a different salt for each key, that would change things, but then you could no longer correlate things based on browser/IP address since they'd result in different hashes. Next, let's look at SHA-3. Reading on wikipedia indicates that typical x64 hardware that most of us are probably running, you're looking at about 12.6 cycles per byte for encryption. Your IP string is 48 bytes, so that's about 600 cycles to encrypt it. My system has 3.7GHz, so that's over 6 million hashes per second. Per CPU (I have six, but an attacker would have many more). But IPv6 is a very large space, combined with your browser string, this sounds like a lot of brute force required. Would a hacker be able to reduce this space? Yes, a lot. First off, if I were to be looking for a specific person, there's a good chance I can figure out their IP address. And probably also their browser. So, if I get your secret key, I have a single hash to produce to find the key. However, if I don't have the key, I have to try all possible keys to look them up - though that only takes me about 3 minutes, if my math is working properly (I have 6 CPUs, 12 if you count hyperthreading). This is probably not a huge barrier. This is also why bitcoin rigs can cycle through so many hashes per second in an attempt to find the next block's solution. Now, if I didn't know the browser string, I could probably guess it. There's not a lot of entropy here - "Mozilla", "Win32", ... there are only so many of these. Yes, some people can customise their browser strings, but almost no one (statistically speaking) does. So, knowing the IP address but not the secret key or the browser string, I will have to try all the possible secret keys times the number of browser strings I'm likely to encounter. If I really care, I just throw one machine at each browser string, and we're still talking minutes. If I don't have the IP address either, now we're just brute-forcing the entire space (though we can still probably restrict the browser string to the likely culprits - we'll not decrypt everything, but we'll be able to use those to narrow everything down). But we still only have to do this once - once we find the secret keys on one hash, we have the secret keys on all hashes, and we can brute force everything else much faster. What you probably want, then, is something that makes it cost-prohibitive to find the secret key in the first place. Instead of a handful of minutes, you need something where this takes days or weeks or months. When I worked on that blockchain, the solution to this was to switch the hashing to Argon 2, and my work PC could only manage about 4 hashings per second (and that used all 4 CPUs). Now going through all possibilities of secret keys will take ~1.5 million times longer. It'll take your server a bit more time to generate the hash as well, but you only have to create it once per entry, not billions of times. (The hashing difficulty for the argon-based blockchain was very very low compared to, say, bitcoin's.) Other than that, I'm not seeing a lot of problem here. But, like I said, I'm not a professional cryptographer, so there may be something I'm missing as well.	[reply] [d/l] [select]
Re: COVID-19 data analytics, cryptography, and some things you should know by papaof5 (Initiate) on Apr 05, 2020 at 16:41 UTC
It's been a while since I wrote any serious perl, but I have some spreadsheets on the COVID-19 statistics which I update several times/day. And would rather automate that. So am looking at Net::Google::Spreadsheets::V4 in CPAN. Here is the link to the shared google sheets where I show these sheets: Shared (RO) folder on Google Drive So I will report back with questions as I attempt this. I am 79 1/2, and maybe should have learned Python 10 years ago - but perl is so much more comfortable. I was so pleased to find Perlmonks alive and well! Keep it up, friends. Boyd	[reply]
Re: COVID-19 data analytics, cryptography, and some things you should know by bliako (Abbot) on Apr 05, 2020 at 08:23 UTC
Browser strings can be mimicked and faked to one's heart's content. It's up to the server to make a thorough check, https://developer.mozilla.org/en-US/docs/Web/HTTP/Browser_detection_using_the_user_agent fwiw, i can not visit the url you cited. Firefox complains about insecure page (`SSL_ERROR_BAD_CERT_DOMAIN`). I tried to view the cerfiticate: it is issued to my ISP!! When I "take the risk" my ISP blocks the page as "not safe". In order to whitelist it i need to login and update preferences which does not work. Come to think of it, it could be an incompetent attempt by some incompetent banana republic operator to incompetently execute a man-in-the-middle. Which suggests to me that you have huge man-in-the-middle or isp-in-the-middle issues to deal with as well - what if they fake the data a patient sends?	[reply] [d/l]
Re^2: COVID-19 data analytics, cryptography, and some things you should know by tachyon-II (Chaplain) on Apr 05, 2020 at 11:19 UTC
Hello bliako, Yes I'm aware of how to fake a browser string. Been doing it for years. If you check your server logs you may well see some amusing messages I left in the string... The server itself runs a let's encrypt certificate but out in front of it is some Cloudflare proxy infrastructure. Last time I looked they were not banana republic operators given they proxy for 11.6% of the top 10 million websites on the Internet. They are the man in the middle. For me, they issue a perfectly valid certificate. What country are you in? I'll VPN in and see if I can reproduce the issue. Given who is doing the proxying it's possible the proxy issue MITM lies with you, not us. Just a thought...	[reply]
Re^3: COVID-19 data analytics, cryptography, and some things you should know by bliako (Abbot) on Apr 05, 2020 at 11:34 UTC
i did not say the problem is with you. The problem is with my provider and I found it really weird that they presented me with a certificated issued to them. (perhaps that's how it works!) Sure you are aware that a browser string can be faked/changed. But how are you going to sanitise it so that you use it for checking uniqueness together with the IP, which in itself is not unique, i.e. a given hospital may have the same IP for all personnel trying to report something to you.	[reply]
Re: COVID-19 data analytics, cryptography, and some things you should know by Anonymous Monk on Apr 05, 2020 at 18:18 UTC
So the task to hand is to convert an IP and Browser String into a cryptographically secure hash that can not be reversed or revealed with a rainbow table. The problem with IPv4 address space is that it's too damn small. I don't know about SHA-3, but there have been examples of people just going through all 2³² possible addresses, concatenating them with site-specific secret and computing SHA-1 hashes of them, effectively reversing the hashing process. It wouldn't have been much slower if each IP address was salted with its own nonce, either. Use of much more computationally complex password hashing functions, such as bcrypt, PBKDF2, scrypt, Argon2 would slow down such attacks tremendously. It might also be simpler to just strip the least significant byte from the IP address and sidestep the whole hashing problem. These 2 identifiers will assist researchers in assessing if the crowdsourced data we are gathering is "gamed" or "believable". Another problem with such datasets is that outsiders may have trouble believing the dataset even if it has a plausible distribution of IP addresses and User-Agent strings (which they wouldn't be sure of because all you would be able to offer them would be opaque hashes). What's to stop the site admins themselves (hypothetically, of course) from faking the data while retaining the IP addresses and the User-Agents, for example?	[reply]