in reply to Re: multiple hash compare, find, create
in thread multiple hash compare, find, create

Hello Eily. I should have provided a better explanation regarding the creation of the large hash tables. I'm using the Splunk API to pull large datasets overnight. However, it is impossible to pull all of the information we need from Splunk in one query. In addition, the Splunk administrators have incorporated limits on the amount of data that can be pulled and the number of daily searches that any single user can pull. I know, rather draconian. Therefore I am gathering all of the raw data in three separate data pulls from Splunk and making extensive use of Perl to parse the raw flat files into usable .csv files.

I then ingest the three large .csv files put them into the three hash tables (takes about 6 minutes) hoping to enable efficient grouping into a single hash table with each key to be associated with three values.

I am stumped

Again:

hash_1 (input)

key => [value_hash_1]

hash_2 (input)

key => [value_hash_2]

hash_3 (input)

key => [value_hash_3]

hash_4 (output)

key => [value_hash_1, value_hash_2, value_hash_3]

Replies are listed 'Best First'.
Re^3: multiple hash compare, find, create
by Eily (Monsignor) on Dec 10, 2018 at 23:57 UTC

    I don't know how time sensitive your project is, but even if you can afford to see your program run for 10 minutes, it's always nice to have something faster than that :D. You can actually speed things up a lot by *not* building the three hashes. johngg had a good point when he said that you should base your search on the smallest hash. Going even further you only need to build that one. This would give something like (that's pseudo-perl, but you should get the idea):

    my %h1 = $csv1->csv_to_hash; # with $csv1 the smallest file my %temp; while (my $line = $csv2->next) { my $key = $line->key; # Ignore lines that don't also exist in h1 # That's way less data to build into the second hash next unless exists $h1{$key}; my $value = $line->value; $temp{ $key } = [ $h1{$key}, $value ]; } %h1 = (); # Free the data from %h1, the matching pairs are all in %tem +p anyway my %out; while (my $line = $csv3->next) { my $key = $line->key; # As before, ignore non relevant keys next unless exists $temp{$key}; # We make a copy of the array ref from %temp # This means that technically modifying it also changes the content +of %temp # But it will be deleted anyway my $values_array = $temp{$key}; my $value = $line->value; push @$values_array, $value; # Add the values into a brand new hash # So that it contains only the needed keys $out{$key} = $values_array; } %temp = (); # The important keys have been copied to %out;

    You should also consider using Text::CSV if you're not already doing so :)

    Edit: oh, and I showed the wrong example, but you should avoid variable names that are identical except for the number as much as possible. So maybe rename %h1 to %ref, if you don't already have better (distinct) names available for your hashes.

Re^3: multiple hash compare, find, create
by AnomalousMonk (Archbishop) on Dec 10, 2018 at 22:07 UTC
    I then ingest the three large .csv files put them into the three hash tables (takes about 6 minutes) ...

    I am stumped

    If you can fit your three hash tables into memory (and the quoted statement says you're doing just that), then I don't see why Eily's approach here (and johngg's identical approach) would present any implementation problem. The only stumbling block I can see is that the output hash might not also fit into memory. In this case, common key/value-set records could be written out to a file, perhaps a .csv file, for later processing.

    If the problem you're facing is that the input data is so large that even one input hash won't fit into memory, then an approach of doing an "external" sort of each input file and then merging the sorted input files with a Perl script suggests itself. This approach scales well to input files of enormous size, far larger than anything that could be accommodated in system RAM, and large numbers of input files, and can still be executed in minutes (in most cases) rather than hours.

    hash_1 (input)

    key => [value_hash_1]

    If this is the structure of your input hashes, I don't see why you're going to the extra (and superfluous) effort of putting each value into an anonymous array — and then taking it out again when you create the output hash value. In what you've shown us so far, each input hash key has only a single value; why stick it into an array?


    Give a man a fish:  <%-{-{-{-<