in reply to Contemplating some set comparison tasks

Using your sample dataset, this code finds 1187 (of the 1890) sources that cover the 3692056 keys in 61 seconds:

#! perl -slw use strict; use Data::Dump qw[ pp ]; my %sources; my( $key, $source ); while( <> ) { chomp; ( $key, $source ) = split( '\|' ); push @{ $sources{ $source } }, $key; } print 'Sources: ', scalar keys %sources; my @set; ## remove singly paired sources from hash to @set @{ $sources{ $source } } == 1 and push @set, $source and delete $sources{ $source } while $source = each %sources; print 'Sources: ', scalar keys %sources; print 'Set: ', scalar @set; ## index multiply paired sources by key my %keys; while( $source = each %sources ) { my $keysRef = $sources{ $source }; for( my $i = 0; $i < @{ $keysRef }; ++$i ) { push @{ $keys{ $keysRef->[ $i ] } }, $source; } } print 'Keys: ', scalar keys %keys; ## For each source, if every paired key is also paired with at least o +ne other source ## then this source is redundant. while( $source = each %sources ) { my $redundant = 1; my $keysRef = $sources{ $source }; KEY: for my $key ( @{ $keysRef } ) { for my $otherSource ( @{ $keys{ $key } } ) { next KEY if $otherSource ne $source; } $redundant = 0, last; } push @set, $source unless $redundant; } printf "Set contains %d sources\n", scalar @set; <STDIN>; pp \@set;

Outputs:

[10:46:45.31] C:\test>1096794.pl data_locks_keys.txt Sources: 1890 Sources: 1819 Set: 71 Keys: 3692056 Set contains 1187 sources ## hit enter here to dump those sources.

With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.

Replies are listed 'Best First'.
Re^2: Contemplating some set comparison tasks
by oiskuu (Hermit) on Aug 14, 2014 at 18:06 UTC

    Or the following input:

    aa|s1 bb|s1 bb|s2 cc|s2 cc|s3 aa|s3 cc|s4 dd|s4 aa|s4
    Yields a ["s4"]. How to interpret that result?

Re^2: Contemplating some set comparison tasks
by poj (Abbot) on Aug 14, 2014 at 17:45 UTC
    Using this tiny data set
    key|source a | a a | c b | c b | b
    gives a set containing 3 sources [a,b,c].
    Shouldn't the result show some reduction, either 2 [a,b] or 1 (the minimum)[c] ?
    poj