Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:
Hello Perl Monks. First time post, long time gleaner of information which is and has been very appreciated. Anyway I have an issue that I truly need some assistance with. I am fairly new to hash tables. What I need to do is compare three hash tables each containing between 14 million and 28 million keys with associated values. Sure, I could nest some loops and let it rip and come back in a year or so to see if it worked and hoping that no server crashes occurred.
Optimally, I'd like to find the common keys between each of the three hash tables and gather the values from each of the three hash tables and create a new, forth hash table that contains key => [value_hash_1, value_hash_2, value_hash_3] And for a key that does not exist in all three hashes, do nothing; the resulting forth hash table will have somewhat less than 14 million k/v pairs.
Do I make sense? And the solution is most likely posted, but I had difficulty finding it.
Thank you very much for your assistance!
|
---|
Replies are listed 'Best First'. | |
---|---|
Re: multiple hash compare, find, create
by johngg (Canon) on Dec 10, 2018 at 18:55 UTC | |
I would choose the smallest of the three hashes and iterate over its keys in a loop doing an exists on the same key in the other two hashes, moving on to the next key unless existing in both of the others. If in all three, add an array ref to that key in the fourth hash. Something like:-
I hope this is helpful. Cheers, JohnGG | [reply] [d/l] |
Re: multiple hash compare, find, create
by Eily (Monsignor) on Dec 10, 2018 at 18:45 UTC | |
How do you get such big hashes in the first place? Searching for duplicates shouldn't be much longer than building the hashes (unless done horribly wrong), so if you can build the hashes in a matter of minutes, finding the triplets should also take more minutes, not years. Maybe you are actually using tied hashes of some sort, like a database? | [reply] |
by supertaco (Novice) on Dec 10, 2018 at 19:06 UTC | |
Hello Eily. I should have provided a better explanation regarding the creation of the large hash tables. I'm using the Splunk API to pull large datasets overnight. However, it is impossible to pull all of the information we need from Splunk in one query. In addition, the Splunk administrators have incorporated limits on the amount of data that can be pulled and the number of daily searches that any single user can pull. I know, rather draconian. Therefore I am gathering all of the raw data in three separate data pulls from Splunk and making extensive use of Perl to parse the raw flat files into usable .csv files. I then ingest the three large .csv files put them into the three hash tables (takes about 6 minutes) hoping to enable efficient grouping into a single hash table with each key to be associated with three values. I am stumped Again: hash_1 (input) key => [value_hash_1]hash_2 (input) key => [value_hash_2]hash_3 (input) key => [value_hash_3]hash_4 (output) key => [value_hash_1, value_hash_2, value_hash_3] | [reply] [d/l] [select] |
by Eily (Monsignor) on Dec 10, 2018 at 23:57 UTC | |
I don't know how time sensitive your project is, but even if you can afford to see your program run for 10 minutes, it's always nice to have something faster than that :D. You can actually speed things up a lot by *not* building the three hashes. johngg had a good point when he said that you should base your search on the smallest hash. Going even further you only need to build that one. This would give something like (that's pseudo-perl, but you should get the idea):
You should also consider using Text::CSV if you're not already doing so :) Edit: oh, and I showed the wrong example, but you should avoid variable names that are identical except for the number as much as possible. So maybe rename %h1 to %ref, if you don't already have better (distinct) names available for your hashes. | [reply] [d/l] |
by AnomalousMonk (Archbishop) on Dec 10, 2018 at 22:07 UTC | |
I then ingest the three large .csv files put them into the three hash tables (takes about 6 minutes) ... If you can fit your three hash tables into memory (and the quoted statement says you're doing just that), then I don't see why Eily's approach here (and johngg's identical approach) would present any implementation problem. The only stumbling block I can see is that the output hash might not also fit into memory. In this case, common key/value-set records could be written out to a file, perhaps a .csv file, for later processing. If the problem you're facing is that the input data is so large that even one input hash won't fit into memory, then an approach of doing an "external" sort of each input file and then merging the sorted input files with a Perl script suggests itself. This approach scales well to input files of enormous size, far larger than anything that could be accommodated in system RAM, and large numbers of input files, and can still be executed in minutes (in most cases) rather than hours. hash_1 (input) If this is the structure of your input hashes, I don't see why you're going to the extra (and superfluous) effort of putting each value into an anonymous array — and then taking it out again when you create the output hash value. In what you've shown us so far, each input hash key has only a single value; why stick it into an array? Give a man a fish: <%-{-{-{-< | [reply] [d/l] [select] |
Re: multiple hash compare, find, create
by kcott (Archbishop) on Dec 11, 2018 at 07:54 UTC | |
Reading through the various bits of information posted here, it seems your current process is: That would seem to be a lot of unnecessary work. Do you have some additional use for the intermediary CSV files? Do you have some additional use for the intermediary hashes? I don't know what your initial raw data looks like. I dummied up 3 files (pm_1227048_raw_1, pm_1227048_raw_2 & pm_1227048_raw_3) from the data posted in "Re^2: multiple hash compare, find, create". Each contains an exact copy of what's posted there; for example,
Unfortunately, while some keys had 2 associated vales, none had 3 values. I've added an extra step, in the script below, to show the technique. If your real data has keys with 3 values, you can dispense with that extra step. The comments should make this clear. Here's the example script to show the technique. Each raw data file is parsed once to create one hash. When all data has been collected, one delete statement removes the unwanted key-value pairs.
Output:
— Ken | [reply] [d/l] [select] |
Re: multiple hash compare, find, create
by stevieb (Canon) on Dec 10, 2018 at 18:37 UTC | |
Here's one way, although it does iterate over all three hashes twice. My brain hasn't quite kicked in yet this Monday morning, so I'm likely seriously overlooking a more efficient way to do it: First, create a new temporary hash that has takes the count of keys, then using that new hash with the appropriate keys (the ones that have been counted at least three times), append the values from all three hashes into the new %combined hash:
Output:
| [reply] [d/l] [select] |
by Eily (Monsignor) on Dec 10, 2018 at 18:54 UTC | |
Iterating through the first hash and looking for keys that exist in the two other is enough. But seeing the size of the input hashes, I'd say it might be better to do that while constructing the hashes (eg, first build %h1, then iterate through the values of %h2 but only insert the keys that already exist in %h1, and then the same for %h3), if they aren't tied hashes in the first place. | [reply] [d/l] |
by stevieb (Canon) on Dec 10, 2018 at 19:14 UTC | |
D'oh! I don't know how I didn't think of that! If the first hash keys don't exist in hash2 or hash3, then they key doesn't exist in all three, does it. Likewise, if hash2 or hash3 have keys hash1 doesn't, same. ;) Sheesh, leave me alone and go home Monday... OP... This answer by Eily is by far a better, more efficient and more elegant approach than what I presented. | [reply] |
by supertaco (Novice) on Dec 10, 2018 at 18:51 UTC | |
| [reply] |
Re: multiple hash compare, find, create
by supertaco (Novice) on Dec 11, 2018 at 13:33 UTC | |
Hello Perl Monks. Thanks for the help yesterday solving my issue. I implemented Cristoforo's solution and it worked like a charm. However, as I have time, I will work the other solutions into my code; I'm learning a lot from everyone here and I greatly appreciate your assistance. Thanks again and I hope to contribute to the knowledge base in the near future. BTW, the entire script processes three hash tables, each with between 14 million and 28 million k/v pairs and outputs a forth hash table in the same size realm in about 20 minutes. Perl Monks rock! | [reply] |
Re: multiple hash compare, find, create
by Cristoforo (Curate) on Dec 10, 2018 at 18:57 UTC | |
Yes, Eily had the most efficient solution. There's no need to create the 3 sets :-(
| [reply] [d/l] |
by supertaco (Novice) on Dec 10, 2018 at 19:38 UTC | |
Hi Cristoforo. I'm trying it anyway :) I appreciate your help! Will look at Eily's solution, too.
| [reply] [d/l] |
by supertaco (Novice) on Dec 11, 2018 at 13:26 UTC | |
Hey Cristoforo, I implemented your solution (without the strikethroughs :)) and it worked like a dream and creates some flexibility for future extensions. I will look into implementing the other solution as I have time. But thanks again!
| [reply] [d/l] |
Re: multiple hash compare, find, create
by LanX (Saint) on Dec 10, 2018 at 23:28 UTC | |
The fastest way to intersect two hashes are hash slices. See Using hashes for set operations... 28 million keys sounds weird this should lead to ~ 2.8 GB of memory consumption. Reorganizing such hashes into nested hashes is recommended if you can bundle accesses to the same sub hash. See Re: write hash to disk after memory limit
Cheers Rolf
| [reply] |
Re: multiple hash compare, find, create
by thanos1983 (Parson) on Dec 10, 2018 at 22:13 UTC | |
Hello Anonymous Monk, I see the monks have replied to your question. I saw the question 2 - 3 hours ago and I start working on it but I got busy with something else work related. Any way just for fun (it took so much time to prepare the hashes :P). Read more... (12 kB) | [reply] [d/l] [select] |
by AnomalousMonk (Archbishop) on Dec 10, 2018 at 23:20 UTC | |
Why is the key "a6fbb013-b75f-4dd7-9d1a-24f566020042" present in the output hash when it seems to appear only in the %hash_1 and %hash_3 input hashes and not in the %hash_2 input hash? Why is this key present in the value-set array for this key in the output hash? (Indeed, every key in the output hash seems to be present in each associated value-set array.) This does not seem to comport with the output that supertaco wants. I would follow the same approach as fellow Monk stevieb. But stevieb has endorsed Eily's approach. Give a man a fish: <%-{-{-{-< | [reply] [d/l] [select] |
by thanos1983 (Parson) on Dec 11, 2018 at 11:31 UTC | |
Hello AnomalousMonk, You are right. My solution after a bit of experimentation I can see that it is wrong. The best approach that I could think is similar to fellow Monk Eily.
Though Eily's approach is better using unless as it check if the hash exists if not skip not necessary to proceed and waste resources. Thanks for spending the time to check and correct me :)
Seeking for Perl wisdom...on the process of learning...not there...yet!
| [reply] [d/l] [select] |
Re: multiple hash compare, find, create
by Anonymous Monk on Dec 10, 2018 at 17:17 UTC | |
| [reply] |
by supertaco (Novice) on Dec 10, 2018 at 17:46 UTC | |
Hi there and thanks! hash_1
hash_2
hash_3
desired output (just blur your eyes and assume each key has the correct values. I faked the output.)
| [reply] [d/l] [select] |