I'm stewing on a particular task that is likely to reappear from time to time. I'd like to find an efficient way to do this work so it can scale up in future.

In summary, I have Keys and Sources. A Key may come from one or many Sources, a Source may generate one or more Keys. What is the minimal list of Sources which cover the Keys?

I have data in the format "Key|Source", /^[0-9A-F]{40}\|[0-9a-f]{40}$/

0000002D9D62AEBE1E0E9DB6C4C4C7C16A163D2C|2f214516cdcab089e83f3e5094928 +fe9611f2f51 000000A9E47BD385A0A3685AA12C2DB6FD727A20|2adeac692d450c54f8830014ee6cb +e3a958c1e60 00000142988AFA836117B1B572FAE4713F200567|04bb7bbed62376f9aaec15fe6f18b +89b27a4c3d8 00000142988AFA836117B1B572FAE4713F200567|6935a8fc967a6ffc20be0f07f2bb4 +a46072a397e 00000142988AFA836117B1B572FAE4713F200567|8c88f4f3c4b1aff760a026759ae80 +7af6c40e015 00000142988AFA836117B1B572FAE4713F200567|974c820f53aded6d6e57ca8de2c33 +206e2b5f439 00000142988AFA836117B1B572FAE4713F200567|b05be3e17bb9987ffb368696ee916 +dd9b9c2f9b3 000001BCBC3B7C8C6E5FC59B686D3568132D218C|0d4c09539f42165bb8b1ab890fe6d +c3d3ca838b3 000001BCBC3B7C8C6E5FC59B686D3568132D218C|9fd421d4e020788100c289d21e4b9 +297acaaff62 000001BCBC3B7C8C6E5FC59B686D3568132D218C|d09565280ebae0a37ca9385bc39c0 +a777a446554 000001E4975FA18878DF5C0989024327FBE1F4DF|55b8ece03f4935f9be667e332d52f +7db3e17b809 000001EF1880189B7DE7C15E971105EB6707DE83|cd15550344b5b9c2785a13ef95830 +15f267ad667 000002F2D7CB4D4B548ADC623F559683D6F59258|36bed8bdb6d66fb67f409166f5db6 +4b02199812f 0000034C9033333F8F58D9C7A64800F509962F3A|3c4b0a3c1acf6e03111805a0d8b4e +879df112b7a 000003682106A4CB4F9D3B1B6E5C08820FCFD1B2|cd15550344b5b9c2785a13ef95830 +15f267ad667 00000368B9CFE1B4CF9F3D38F3EFD82840BA280D|50edd315b9217345b1728c38b0265 +7df42043197 000003A16A5A1C6CCDDBE548E85261422489A458|691845459c0ad35b28cce4dffc0e3 +ee8912fb0f5 0000046FD530A338D03422C7D0D16A9EE087ECD9|13e213f346ce624e9be99b356ab91 +25af563a375 0000046FD530A338D03422C7D0D16A9EE087ECD9|67c0da2da88a23a803733cea951e8 +4974b34d029 00000472E2B96A6CD0CBE8614779C5C8197BB42D|0c5e6cdb06c52160ded398d173922 +46269165e0a

I am now dealing with a 190,000,000+ set of Key|Source pairs. There are 30,000,000 unique Key values and 20,000 unique Source values. There are 23,800,000 Keys that appear only once, so I know I must have at least their Sources in the final set I want. I need to find the smallest set of Sources that cover the 6,200,000 remaining Keys.

I can think of a brute-force iteration method to do this, but there should be a more elegant (and hopefully more efficient) way to find the smallest coverage set of Sources over the 6,200,000 Keys.

My data is usually sorted by Key value, and how I'm used to thinking of it. If I sort on Source value, I might have an inspiration.

So I'm stewing on this...

UPDATE 2014-08-12

I have shrunk the problem set by identifying the keys which only come from single sources. I am now left only to consider the set of many-many key-source relationships. That is 9,197,129 relations, between 1,890 sources and 3,692,089 keys. I was able reduce the key signatures from 40 char to 11 char and source signatures from 40 char to 6 char, to buy myself some space.

Off to bang on this a while...

Complete 170 MB data in 50MB zip file : data_locks_keys.zip
format: key|source

Replies are listed 'Best First'.
Re: Contemplating some set comparison tasks
by oiskuu (Hermit) on Aug 08, 2014 at 22:22 UTC
Re: Contemplating some set comparison tasks
by erix (Prior) on Aug 09, 2014 at 11:23 UTC

    I'm not sure exactly what you want as result but perhaps you want DISTINCT ON.

    Slurping that list above in a table 'setcomp' with a single column 'c', you'd get something like:

    #!/bin/sh t=setcomp echo " -- drop table if exists $t; -- enable if necessary create table $t as select * from (values ('0000002D9D62AEBE1E0E9DB6C4C4C7C16A163D2C|2f214516cdcab089e83f3e509 +4928fe9611f2f51'), ('000000A9E47BD385A0A3685AA12C2DB6FD727A20|2adeac692d450c54f8830014e +e6cbe3a958c1e60'), ('00000142988AFA836117B1B572FAE4713F200567|04bb7bbed62376f9aaec15fe6 +f18b89b27a4c3d8'), ('00000142988AFA836117B1B572FAE4713F200567|6935a8fc967a6ffc20be0f07f +2bb4a46072a397e'), ('00000142988AFA836117B1B572FAE4713F200567|8c88f4f3c4b1aff760a026759 +ae807af6c40e015'), ('00000142988AFA836117B1B572FAE4713F200567|974c820f53aded6d6e57ca8de +2c33206e2b5f439'), ('00000142988AFA836117B1B572FAE4713F200567|b05be3e17bb9987ffb368696e +e916dd9b9c2f9b3'), ('000001BCBC3B7C8C6E5FC59B686D3568132D218C|0d4c09539f42165bb8b1ab890 +fe6dc3d3ca838b3'), ('000001BCBC3B7C8C6E5FC59B686D3568132D218C|9fd421d4e020788100c289d21 +e4b9297acaaff62'), ('000001BCBC3B7C8C6E5FC59B686D3568132D218C|d09565280ebae0a37ca9385bc +39c0a777a446554'), ('000001E4975FA18878DF5C0989024327FBE1F4DF|55b8ece03f4935f9be667e332 +d52f7db3e17b809'), ('000001EF1880189B7DE7C15E971105EB6707DE83|cd15550344b5b9c2785a13ef9 +583015f267ad667'), ('000002F2D7CB4D4B548ADC623F559683D6F59258|36bed8bdb6d66fb67f409166f +5db64b02199812f'), ('0000034C9033333F8F58D9C7A64800F509962F3A|3c4b0a3c1acf6e03111805a0d +8b4e879df112b7a'), ('000003682106A4CB4F9D3B1B6E5C08820FCFD1B2|cd15550344b5b9c2785a13ef9 +583015f267ad667'), ('00000368B9CFE1B4CF9F3D38F3EFD82840BA280D|50edd315b9217345b1728c38b +02657df42043197'), ('000003A16A5A1C6CCDDBE548E85261422489A458|691845459c0ad35b28cce4dff +c0e3ee8912fb0f5'), ('0000046FD530A338D03422C7D0D16A9EE087ECD9|13e213f346ce624e9be99b356 +ab9125af563a375'), ('0000046FD530A338D03422C7D0D16A9EE087ECD9|67c0da2da88a23a803733cea9 +51e84974b34d029'), ('00000472E2B96A6CD0CBE8614779C5C8197BB42D|0c5e6cdb06c52160ded398d17 +392246269165e0a') ) as f(c) ; " | psql -qX echo " -- count all (20 rows) select count(*) from $t; -- show key count where more than 1: select key, count(*) from ( select substring(c, 1, 40) as key , substring(c, 42, 40) as source from $t ) as f group by key having count(*) > 1 ; -- show source count where more than 1: select source, count(*) from ( select substring(c, 1, 40) as key , substring(c, 42, 40) as source from $t ) as f group by source having count(*) > 1 ; -- distinct on key (13 rows) select distinct on (key) * from ( select substring(c, 1, 40) as key , substring(c, 42, 40) as source from $t ) as f ; " | psql -Xq

    ... with output:

    $ ./pm-1096794.sh count ------- 20 (1 row) key | count ------------------------------------------+------- 0000046FD530A338D03422C7D0D16A9EE087ECD9 | 2 000001BCBC3B7C8C6E5FC59B686D3568132D218C | 3 00000142988AFA836117B1B572FAE4713F200567 | 5 (3 rows) source | count ------------------------------------------+------- cd15550344b5b9c2785a13ef9583015f267ad667 | 2 (1 row) key | source ------------------------------------------+--------------------------- +--------------- 0000002D9D62AEBE1E0E9DB6C4C4C7C16A163D2C | 2f214516cdcab089e83f3e5094 +928fe9611f2f51 000000A9E47BD385A0A3685AA12C2DB6FD727A20 | 2adeac692d450c54f8830014ee +6cbe3a958c1e60 00000142988AFA836117B1B572FAE4713F200567 | 04bb7bbed62376f9aaec15fe6f +18b89b27a4c3d8 000001BCBC3B7C8C6E5FC59B686D3568132D218C | 0d4c09539f42165bb8b1ab890f +e6dc3d3ca838b3 000001E4975FA18878DF5C0989024327FBE1F4DF | 55b8ece03f4935f9be667e332d +52f7db3e17b809 000001EF1880189B7DE7C15E971105EB6707DE83 | cd15550344b5b9c2785a13ef95 +83015f267ad667 000002F2D7CB4D4B548ADC623F559683D6F59258 | 36bed8bdb6d66fb67f409166f5 +db64b02199812f 0000034C9033333F8F58D9C7A64800F509962F3A | 3c4b0a3c1acf6e03111805a0d8 +b4e879df112b7a 000003682106A4CB4F9D3B1B6E5C08820FCFD1B2 | cd15550344b5b9c2785a13ef95 +83015f267ad667 00000368B9CFE1B4CF9F3D38F3EFD82840BA280D | 50edd315b9217345b1728c38b0 +2657df42043197 000003A16A5A1C6CCDDBE548E85261422489A458 | 691845459c0ad35b28cce4dffc +0e3ee8912fb0f5 0000046FD530A338D03422C7D0D16A9EE087ECD9 | 13e213f346ce624e9be99b356a +b9125af563a375 00000472E2B96A6CD0CBE8614779C5C8197BB42D | 0c5e6cdb06c52160ded398d173 +92246269165e0a (13 rows)

    (Perhaps you already have 'key' and 'source' in separate columns so that you don't need the subselect.)

    Is that last SQL statement the action you want performed (reducing the initial 20 rows to 13 unique key values)?

      They are in a database, in separate columns, and I am familiar with the DISTINCT function. Those get me pretty far, but my final destination would be to know the fewest number of distinct sources that contain all of the keys. I'm not sure if SQL has some set fuctions that can do this - and even if it does, that I have the right grasp of the problem to apply them. :-/

        DISTINCT is really different from DISTINCT ON. And to be honest you sound as if you do not realize that.

        DISTINCT is a general SQL feature that all RDBMS have.

        DISTINCT ON is a PostgreSQL-only extension. See my examples, and the SELECT manual page that I linked to upthread.

        So I still think that the last statement in my example (the statement that reduces your 20-row set to a 13-row set) does what you require. Is that not so?

        select -- this line is the DISTINCT ON clause: distinct on (key) -- the select list follows: key, source from yourtable -- often you'd want an order by clause -- to control which rows are discarded -- In this case it does not seem to be important ;
Re: Contemplating some set comparison tasks
by wjw (Priest) on Aug 08, 2014 at 19:04 UTC

    Honestly, my gut reaction is to use a database to do this, especially since you say you may have to repeat/scale in the future.

    Use Perl to get the values into a DB, then run queries to trim things down the way you want...

    Is a DB out of the question for some reason?

    ...the majority is always wrong, and always the last to know about it...

    Insanity: Doing the same thing over and over again and expecting different results...

    A solution is nothing more than a clearly stated problem...otherwise, the problem is not a problem, it is a facct

      It's in a postgreSQL DB actually. I'm less familiar with SQL, so the SELECT DISTINCT's don't scare me, but I get a little spooked at GROUP BY's and I fear doing some kind of JOIN on this table and itself.

      If I can come to grips with the concept I want to accomplish - and Perl is the mechanism I'm hoping I can do this with the easiest - I ought to be able to work out the SQL solution, and then you're right, the DB engine probably would be best.

        You said you're looking for something efficient - for such a relatively large data set, having the DB do the work should almost always be more efficient than reading the data into Perl and doing the work there.

        If you're worried about getting the queries right, how about setting up some test cases? A few minimal sets of data for which you can figure out the desired output "by hand" for the various queries you want to run, and then work on the queries until they match the desired output.

Re: Contemplating some set comparison tasks
by Laurent_R (Canon) on Aug 09, 2014 at 18:06 UTC
    My previous post had an error which would have made the solution unpractical for large input data. This is my corrected attempt.

    Well I do not know how well what I'll show will work with your data, but I think that could help reduce considerably the size of the data before having to go for a brute force or approximate heuristics solution.

    The idea is to read the data from the file (sorted by the keys) once, detect the keys that appear only once, and store in a %required_sources hash the sources corresponding to those keys. Since your data has 20,000 unique sources values, the hash will have at most 20,000 entries, which is manageable (and even scalable to a certain extent).

    Next, you read the data once more from the file and remove from it all the keys corresponding to the sources of keys that appear only once. This should give you a much smaller list of keys/sources (at most 30m - 23.8m = 6.2m, but probably much less since those sources are likely to lead to the eliminations of keys that are not appearing only once). Among the remaining keys, it is quite likely that some new keys will appear only once, so that you can repeat the process and further reduce the list. You may be able to repeat this whole process several times, but it is likely that at some point you will no longer be able to repeat it because all keys will appear more than once. Hopefully the data will be much much smaller. At this point, you can either use a heuristics if you don't need the top best solution (for example, trying to pick up the source that comes up the most often to further reduce the data and then repeat the whole process). Or, if you really need the very best solution, then use brute force on the remaining data.

    In the small code example below, I stored the data in an array to make it easier to loop twice on it, but in the real situation, you would have to loop twice on the file. The program actually iterates on the subsequent filtering process only once for reasons that will become obvious below.

    use strict; use warnings; use Data::Dumper; my (%required_sources, %remaining_keys); my $count = 0; my ($previous_key, $previous_source) = ("", ""); my @DATA = qw / 0000002D9D62AEBE1E0E9DB6C4C4C7C16A163D2C|2f214516cdcab089e83f3e5094928 +fe9611f2f51 000000A9E47BD385A0A3685AA12C2DB6FD727A20|2adeac692d450c54f8830014ee6cb +e3a958c1e60 00000142988AFA836117B1B572FAE4713F200567|04bb7bbed62376f9aaec15fe6f18b +89b27a4c3d8 00000142988AFA836117B1B572FAE4713F200567|6935a8fc967a6ffc20be0f07f2bb4 +a46072a397e 00000142988AFA836117B1B572FAE4713F200567|8c88f4f3c4b1aff760a026759ae80 +7af6c40e015 00000142988AFA836117B1B572FAE4713F200567|974c820f53aded6d6e57ca8de2c33 +206e2b5f439 00000142988AFA836117B1B572FAE4713F200567|b05be3e17bb9987ffb368696ee916 +dd9b9c2f9b3 000001BCBC3B7C8C6E5FC59B686D3568132D218C|0d4c09539f42165bb8b1ab890fe6d +c3d3ca838b3 000001BCBC3B7C8C6E5FC59B686D3568132D218C|9fd421d4e020788100c289d21e4b9 +297acaaff62 000001BCBC3B7C8C6E5FC59B686D3568132D218C|d09565280ebae0a37ca9385bc39c0 +a777a446554 000001E4975FA18878DF5C0989024327FBE1F4DF|55b8ece03f4935f9be667e332d52f +7db3e17b809 000001EF1880189B7DE7C15E971105EB6707DE83|cd15550344b5b9c2785a13ef95830 +15f267ad667 000002F2D7CB4D4B548ADC623F559683D6F59258|36bed8bdb6d66fb67f409166f5db6 +4b02199812f 0000034C9033333F8F58D9C7A64800F509962F3A|3c4b0a3c1acf6e03111805a0d8b4e +879df112b7a 000003682106A4CB4F9D3B1B6E5C08820FCFD1B2|cd15550344b5b9c2785a13ef95830 +15f267ad667 00000368B9CFE1B4CF9F3D38F3EFD82840BA280D|50edd315b9217345b1728c38b0265 +7df42043197 000003A16A5A1C6CCDDBE548E85261422489A458|691845459c0ad35b28cce4dffc0e3 +ee8912fb0f5 0000046FD530A338D03422C7D0D16A9EE087ECD9|13e213f346ce624e9be99b356ab91 +25af563a375 0000046FD530A338D03422C7D0D16A9EE087ECD9|67c0da2da88a23a803733cea951e8 +4974b34d029 00000472E2B96A6CD0CBE8614779C5C8197BB42D|0c5e6cdb06c52160ded398d173922 +46269165e0a /; for (@DATA) { my ($key, $source) = split /\|/, $_; if ($key eq $previous_key) { $count ++; } else { $required_sources{$previous_source} = 1 if $count == 1; $previous_key = $key; $count = 1; } $previous_source = $source; } $required_sources{$previous_source} = 1 if $count == 1; for (@DATA) { my ($key, $source) = split /\|/, $_; $remaining_keys{$key} = $source unless exists $required_sources{$s +ource}; } print "Remaining keys (before the while loop): \n", Dumper \%remaining +_keys; my $continue = 1; while ($continue and keys %remaining_keys) { my %unique_keys; $unique_keys{$_}++ for keys %remaining_keys; $continue = 0; for my $key (keys %unique_keys) { if ($unique_keys{$key} == 1) { my $source = $remaining_keys{$key}; $required_sources{$source} = 1; delete $remaining_keys{$key}; my @additional_keys_to_be_deleted = grep {$remaining_keys{ +$_} eq $source} keys %remaining_keys; delete $remaining_keys{$_} for @additional_keys_to_be_dele +ted ; $continue = 1; } } } print "Required sources : \n", Dumper \%required_sources; print "\n\n"; print "Remaining keys: \n", Dumper \%remaining_keys;
    This produces the following output:
    $ perl key_source.pl Remaining keys (before the while loop): $VAR1 = { '0000046FD530A338D03422C7D0D16A9EE087ECD9' => '67c0da2da88a2 +3a803733cea951e84974b34d029', '00000142988AFA836117B1B572FAE4713F200567' => 'b05be3e17bb99 +87ffb368696ee916dd9b9c2f9b3', '000001BCBC3B7C8C6E5FC59B686D3568132D218C' => 'd09565280ebae +0a37ca9385bc39c0a777a446554' }; Required sources : $VAR1 = { '50edd315b9217345b1728c38b02657df42043197' => 1, '36bed8bdb6d66fb67f409166f5db64b02199812f' => 1, '55b8ece03f4935f9be667e332d52f7db3e17b809' => 1, 'b05be3e17bb9987ffb368696ee916dd9b9c2f9b3' => 1, '67c0da2da88a23a803733cea951e84974b34d029' => 1, '2adeac692d450c54f8830014ee6cbe3a958c1e60' => 1, 'd09565280ebae0a37ca9385bc39c0a777a446554' => 1, '3c4b0a3c1acf6e03111805a0d8b4e879df112b7a' => 1, '691845459c0ad35b28cce4dffc0e3ee8912fb0f5' => 1, '0c5e6cdb06c52160ded398d17392246269165e0a' => 1, '2f214516cdcab089e83f3e5094928fe9611f2f51' => 1, 'cd15550344b5b9c2785a13ef9583015f267ad667' => 1 }; Remaining keys: $VAR1 = {};
    As you can see, after having read the file twice, there are only three keys left in the data and they appear only once, so that the process is essentially finished, we only need iterate once in the while loop to complete the process.

    We would most probably not be so lucky with your actual data, but given that you have only 20k unique sources for 30m unique keys, there is a fair chance that this process iterated a few times will considerably reduce the size of the data.

    I am not convinced that a database would do better. The process above can be very fast, hash lookups are much much faster that DB queries. At the very least, I would use this iteration process to drastically reduce the data size before perhaps trying to use a database for the final brute force phase.

    This problem is very exciting to me, I love these types of problems, thank you for submitting it. I wish I had the full data set to try out and see how well (or badly) the approach suggested above works.

      Generate your sample data!

      #! /usr/bin/perl open(RND, '<', '/dev/urandom') or die; sub id { local $/ = \20; unpack 'H40', <RND> } # random ID sub nk { int 1000**(1+rand 42)**-1 } # shaped rnd my @S = map id, 1 .. 20_000; do { print "$_|$S[rand@S]\n" for (id) x nk } for 1 .. 30_000_000;

        Thanks for the suggestion, I had thought about it but I quickly dismissed the idea because the efficiency of the suggested algorithm relies to a very large extent on the shape of the data. Randomly generated data might behave completely differently from actual data, and we don't even know if it would be for the better or for the worse.

        For example, the standard quick sort algorithm behaves well on random generated data (with an average O(n log n) complexity) but has a bad worst case complexity (O(nē)); it behaves particularly poorly on almost sorted data. Heap sort and merge sort behave reasonably well in all cases (O(n log n)). Other algorithms such as bubble sort or insertion sort have low performance on random data (O(nē)) but exhibit an extremely good best case (O(n)).

        With the actual problem described by dwhite20899, I would probably first run the initial elimination phase I suggested (removing all singletons, i.e. keys that appear only once, plus all the keys having the same source as those singletons, plus possibly duplicates of the keys eliminated because of the source), and then look closely at the remaining data to figure out the best way to go. Depending on the actual data shape, there may still be at this point 6.2 million lines (if the elimination through the source did not result in any elimination of additional keys) or may be only 500k, or may be even much much less. And the remaining data may have a lot of new singletons, making the second part of the algorithm efficient, or maybe not.

        In brief, using randomly generated data can't tell us anything about the performance of the suggested algorithm on the real data.

Re: Contemplating some set comparison tasks
by hexcoder (Curate) on Aug 09, 2014 at 10:29 UTC
    Hello,

    I have no idea how efficient data bases are wrt set covering, but if you want to try within Perl, there is a CPAN module using the 'greedy algorithm' (which does a pretty good approximation of the optimal set cover).

      Interesting. I may use this on a subset and see if it can scale up. Thank you.
Re: Contemplating some set comparison tasks
by BrowserUk (Patriarch) on Aug 14, 2014 at 09:52 UTC

    Using your sample dataset, this code finds 1187 (of the 1890) sources that cover the 3692056 keys in 61 seconds:

    #! perl -slw use strict; use Data::Dump qw[ pp ]; my %sources; my( $key, $source ); while( <> ) { chomp; ( $key, $source ) = split( '\|' ); push @{ $sources{ $source } }, $key; } print 'Sources: ', scalar keys %sources; my @set; ## remove singly paired sources from hash to @set @{ $sources{ $source } } == 1 and push @set, $source and delete $sources{ $source } while $source = each %sources; print 'Sources: ', scalar keys %sources; print 'Set: ', scalar @set; ## index multiply paired sources by key my %keys; while( $source = each %sources ) { my $keysRef = $sources{ $source }; for( my $i = 0; $i < @{ $keysRef }; ++$i ) { push @{ $keys{ $keysRef->[ $i ] } }, $source; } } print 'Keys: ', scalar keys %keys; ## For each source, if every paired key is also paired with at least o +ne other source ## then this source is redundant. while( $source = each %sources ) { my $redundant = 1; my $keysRef = $sources{ $source }; KEY: for my $key ( @{ $keysRef } ) { for my $otherSource ( @{ $keys{ $key } } ) { next KEY if $otherSource ne $source; } $redundant = 0, last; } push @set, $source unless $redundant; } printf "Set contains %d sources\n", scalar @set; <STDIN>; pp \@set;

    Outputs:

    [10:46:45.31] C:\test>1096794.pl data_locks_keys.txt Sources: 1890 Sources: 1819 Set: 71 Keys: 3692056 Set contains 1187 sources ## hit enter here to dump those sources.

    With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.

      Or the following input:

      aa|s1 bb|s1 bb|s2 cc|s2 cc|s3 aa|s3 cc|s4 dd|s4 aa|s4
      Yields a ["s4"]. How to interpret that result?

      Using this tiny data set
      key|source a | a a | c b | c b | b
      gives a set containing 3 sources [a,b,c].
      Shouldn't the result show some reduction, either 2 [a,b] or 1 (the minimum)[c] ?
      poj
Re: Contemplating some set comparison tasks
by Laurent_R (Canon) on Aug 09, 2014 at 15:36 UTC
    Well I do not know how well what I'll show will work with your data, but I think that could help reduce considerably the size of the data before having to go for a brute force solution.

    The idea is to read the data from the file once and store in a hash the keys, the first source found for that key and the number of occurrence of that key. Since your data has 20,000 unique sources values, the hash will have at most 20,000 entries, which is manageable (and even scalable to a certain extent).

    Update: The paragraph I crossed above is wrong. Therefore my solution, which worked on your small data, would not work for very large data set. I therefore removed it from this post and will come back with a better one a bit later. I guess it is not a problem to remove the content, since nobody seems to have seen it (it has been there for only about 20 minutes).

Re: Contemplating some set comparison tasks
by erix (Prior) on Aug 13, 2014 at 19:49 UTC

    Here is a bash program that loads your sample data into a table 'data_locks_keys' and then from that table extracts the 'single' values of column 'key' (with the first associated value of 'source') into a derived table 'data_locks_keys_distinct'.

    Takes 1m 30s on 9.5dev postgres (low-end desktop).

    #!/bin/sh # wget http://white-barn.com/tmp/data_locks_keys.zip time unzip -p data_locks_keys.zip \ | psql -c " drop table if exists data_locks_keys; create table data_locks_keys(key text, source text); copy data_locks_keys from stdin with ( format csv, delimiter E'|', header FALSE ); " # main table data_locks_keys now has 9,197,129 rows echo " create table data_locks_keys_distinct as select distinct on (key) key, source from data_locks_keys ; " | psql # derived table data_locks_keys_distinct has 3,692,089 rows (with colu +mn 'key' now unique)

    I'm really getting curious whether this DISTINCT ON statement isn't what you are looking for...