in reply to Re^2: Hashes, keys and multiple histogram
in thread Hashes, keys and multiple histogram

Apologies to everyone who tried to help. I was trying many iterations (beating my head against the wall) of the code and thought I put the latest up.

Now I'm trying to understand Laurent's short code of an array of hashes versus individual hash elements

my %hist; while (<DATA>) { chomp; my ($col0, @element) = split; $hist{$col0}{$_}++ for @element; };

I'm looking to implement some simple set theory with statistics.

To get keys that are unique to each set, i.e. subtract the intersection of other sets

From here http://www.perlmonks.org/?node=How%20can%20I%20get%20the%20unique%20keys%20from%20two%20hashes%3F, it gives the following code

my %seen = (); for my $element (keys(%hist1), keys(%hist2)) { $seen{$element}++; } my @uniq = keys %seen;

which is why I thought it would be simpler to have separate hash arrays. There are elements in hist1 that are not in hist2 and vice versa. Is finding unique keys this way faster that subtracting the intersection from each set? A-(A int B)? At the moment I'm working with small sample data to debug but will be dealing with 12+Gb of data to process.

Replies are listed 'Best First'.
Re^4: Hashes, keys and multiple histogram
by Laurent_R (Canon) on Aug 18, 2014 at 07:08 UTC
    If you have a hash of hashes (and not array of hashes) such as the one I showed in my second version of the program, you can use the code you showed (which finds the union, rather than the intersection, of two sets, i.e. a list of unique keys present in both sets) making the following small changes (I think it should be right, but I cannot test right now):
    my %seen = (); for my $element (keys(%{$hist{1}}), keys(%{$hist{2}})) { $seen{$element}++; } my @uniq = keys %seen;
    Having said that, we might have another serious problem here. 12 GB is a lot of data, it is far from being sure that such huge volumes of data will fit into your computer memory. In other words, you might not be able to store all your data into a hash. I am not talking of a Perl limitation, but of a limitation of your hardware.

      Laurent,

      Many thanks for the help, I'm running some benchmarks now.