in reply to Contemplating some set comparison tasks

My previous post had an error which would have made the solution unpractical for large input data. This is my corrected attempt.

Well I do not know how well what I'll show will work with your data, but I think that could help reduce considerably the size of the data before having to go for a brute force or approximate heuristics solution.

The idea is to read the data from the file (sorted by the keys) once, detect the keys that appear only once, and store in a %required_sources hash the sources corresponding to those keys. Since your data has 20,000 unique sources values, the hash will have at most 20,000 entries, which is manageable (and even scalable to a certain extent).

Next, you read the data once more from the file and remove from it all the keys corresponding to the sources of keys that appear only once. This should give you a much smaller list of keys/sources (at most 30m - 23.8m = 6.2m, but probably much less since those sources are likely to lead to the eliminations of keys that are not appearing only once). Among the remaining keys, it is quite likely that some new keys will appear only once, so that you can repeat the process and further reduce the list. You may be able to repeat this whole process several times, but it is likely that at some point you will no longer be able to repeat it because all keys will appear more than once. Hopefully the data will be much much smaller. At this point, you can either use a heuristics if you don't need the top best solution (for example, trying to pick up the source that comes up the most often to further reduce the data and then repeat the whole process). Or, if you really need the very best solution, then use brute force on the remaining data.

In the small code example below, I stored the data in an array to make it easier to loop twice on it, but in the real situation, you would have to loop twice on the file. The program actually iterates on the subsequent filtering process only once for reasons that will become obvious below.

use strict; use warnings; use Data::Dumper; my (%required_sources, %remaining_keys); my $count = 0; my ($previous_key, $previous_source) = ("", ""); my @DATA = qw / 0000002D9D62AEBE1E0E9DB6C4C4C7C16A163D2C|2f214516cdcab089e83f3e5094928 +fe9611f2f51 000000A9E47BD385A0A3685AA12C2DB6FD727A20|2adeac692d450c54f8830014ee6cb +e3a958c1e60 00000142988AFA836117B1B572FAE4713F200567|04bb7bbed62376f9aaec15fe6f18b +89b27a4c3d8 00000142988AFA836117B1B572FAE4713F200567|6935a8fc967a6ffc20be0f07f2bb4 +a46072a397e 00000142988AFA836117B1B572FAE4713F200567|8c88f4f3c4b1aff760a026759ae80 +7af6c40e015 00000142988AFA836117B1B572FAE4713F200567|974c820f53aded6d6e57ca8de2c33 +206e2b5f439 00000142988AFA836117B1B572FAE4713F200567|b05be3e17bb9987ffb368696ee916 +dd9b9c2f9b3 000001BCBC3B7C8C6E5FC59B686D3568132D218C|0d4c09539f42165bb8b1ab890fe6d +c3d3ca838b3 000001BCBC3B7C8C6E5FC59B686D3568132D218C|9fd421d4e020788100c289d21e4b9 +297acaaff62 000001BCBC3B7C8C6E5FC59B686D3568132D218C|d09565280ebae0a37ca9385bc39c0 +a777a446554 000001E4975FA18878DF5C0989024327FBE1F4DF|55b8ece03f4935f9be667e332d52f +7db3e17b809 000001EF1880189B7DE7C15E971105EB6707DE83|cd15550344b5b9c2785a13ef95830 +15f267ad667 000002F2D7CB4D4B548ADC623F559683D6F59258|36bed8bdb6d66fb67f409166f5db6 +4b02199812f 0000034C9033333F8F58D9C7A64800F509962F3A|3c4b0a3c1acf6e03111805a0d8b4e +879df112b7a 000003682106A4CB4F9D3B1B6E5C08820FCFD1B2|cd15550344b5b9c2785a13ef95830 +15f267ad667 00000368B9CFE1B4CF9F3D38F3EFD82840BA280D|50edd315b9217345b1728c38b0265 +7df42043197 000003A16A5A1C6CCDDBE548E85261422489A458|691845459c0ad35b28cce4dffc0e3 +ee8912fb0f5 0000046FD530A338D03422C7D0D16A9EE087ECD9|13e213f346ce624e9be99b356ab91 +25af563a375 0000046FD530A338D03422C7D0D16A9EE087ECD9|67c0da2da88a23a803733cea951e8 +4974b34d029 00000472E2B96A6CD0CBE8614779C5C8197BB42D|0c5e6cdb06c52160ded398d173922 +46269165e0a /; for (@DATA) { my ($key, $source) = split /\|/, $_; if ($key eq $previous_key) { $count ++; } else { $required_sources{$previous_source} = 1 if $count == 1; $previous_key = $key; $count = 1; } $previous_source = $source; } $required_sources{$previous_source} = 1 if $count == 1; for (@DATA) { my ($key, $source) = split /\|/, $_; $remaining_keys{$key} = $source unless exists $required_sources{$s +ource}; } print "Remaining keys (before the while loop): \n", Dumper \%remaining +_keys; my $continue = 1; while ($continue and keys %remaining_keys) { my %unique_keys; $unique_keys{$_}++ for keys %remaining_keys; $continue = 0; for my $key (keys %unique_keys) { if ($unique_keys{$key} == 1) { my $source = $remaining_keys{$key}; $required_sources{$source} = 1; delete $remaining_keys{$key}; my @additional_keys_to_be_deleted = grep {$remaining_keys{ +$_} eq $source} keys %remaining_keys; delete $remaining_keys{$_} for @additional_keys_to_be_dele +ted ; $continue = 1; } } } print "Required sources : \n", Dumper \%required_sources; print "\n\n"; print "Remaining keys: \n", Dumper \%remaining_keys;
This produces the following output:
$ perl key_source.pl Remaining keys (before the while loop): $VAR1 = { '0000046FD530A338D03422C7D0D16A9EE087ECD9' => '67c0da2da88a2 +3a803733cea951e84974b34d029', '00000142988AFA836117B1B572FAE4713F200567' => 'b05be3e17bb99 +87ffb368696ee916dd9b9c2f9b3', '000001BCBC3B7C8C6E5FC59B686D3568132D218C' => 'd09565280ebae +0a37ca9385bc39c0a777a446554' }; Required sources : $VAR1 = { '50edd315b9217345b1728c38b02657df42043197' => 1, '36bed8bdb6d66fb67f409166f5db64b02199812f' => 1, '55b8ece03f4935f9be667e332d52f7db3e17b809' => 1, 'b05be3e17bb9987ffb368696ee916dd9b9c2f9b3' => 1, '67c0da2da88a23a803733cea951e84974b34d029' => 1, '2adeac692d450c54f8830014ee6cbe3a958c1e60' => 1, 'd09565280ebae0a37ca9385bc39c0a777a446554' => 1, '3c4b0a3c1acf6e03111805a0d8b4e879df112b7a' => 1, '691845459c0ad35b28cce4dffc0e3ee8912fb0f5' => 1, '0c5e6cdb06c52160ded398d17392246269165e0a' => 1, '2f214516cdcab089e83f3e5094928fe9611f2f51' => 1, 'cd15550344b5b9c2785a13ef9583015f267ad667' => 1 }; Remaining keys: $VAR1 = {};
As you can see, after having read the file twice, there are only three keys left in the data and they appear only once, so that the process is essentially finished, we only need iterate once in the while loop to complete the process.

We would most probably not be so lucky with your actual data, but given that you have only 20k unique sources for 30m unique keys, there is a fair chance that this process iterated a few times will considerably reduce the size of the data.

I am not convinced that a database would do better. The process above can be very fast, hash lookups are much much faster that DB queries. At the very least, I would use this iteration process to drastically reduce the data size before perhaps trying to use a database for the final brute force phase.

This problem is very exciting to me, I love these types of problems, thank you for submitting it. I wish I had the full data set to try out and see how well (or badly) the approach suggested above works.

Replies are listed 'Best First'.
Re^2: Contemplating some set comparison tasks
by Anonymous Monk on Aug 10, 2014 at 01:04 UTC

    Generate your sample data!

    #! /usr/bin/perl open(RND, '<', '/dev/urandom') or die; sub id { local $/ = \20; unpack 'H40', <RND> } # random ID sub nk { int 1000**(1+rand 42)**-1 } # shaped rnd my @S = map id, 1 .. 20_000; do { print "$_|$S[rand@S]\n" for (id) x nk } for 1 .. 30_000_000;

      Thanks for the suggestion, I had thought about it but I quickly dismissed the idea because the efficiency of the suggested algorithm relies to a very large extent on the shape of the data. Randomly generated data might behave completely differently from actual data, and we don't even know if it would be for the better or for the worse.

      For example, the standard quick sort algorithm behaves well on random generated data (with an average O(n log n) complexity) but has a bad worst case complexity (O(nē)); it behaves particularly poorly on almost sorted data. Heap sort and merge sort behave reasonably well in all cases (O(n log n)). Other algorithms such as bubble sort or insertion sort have low performance on random data (O(nē)) but exhibit an extremely good best case (O(n)).

      With the actual problem described by dwhite20899, I would probably first run the initial elimination phase I suggested (removing all singletons, i.e. keys that appear only once, plus all the keys having the same source as those singletons, plus possibly duplicates of the keys eliminated because of the source), and then look closely at the remaining data to figure out the best way to go. Depending on the actual data shape, there may still be at this point 6.2 million lines (if the elimination through the source did not result in any elimination of additional keys) or may be only 500k, or may be even much much less. And the remaining data may have a lot of new singletons, making the second part of the algorithm efficient, or maybe not.

      In brief, using randomly generated data can't tell us anything about the performance of the suggested algorithm on the real data.