Re: Small Hash a Gateway to Large Hash?

What is your goal for moving away from accessing the 3GB hash? Is it the obvious, of not holding so much memory hostage, or is it something less obvious such as speed performance? Why is it useful to you to hold smaller hashes in memory? Is it the case that you don't really need immediate random access to the entire 3GB hash at any given moment? What does the script actually do?

Would it be reasonable to build an SQLite table so that you're not holding it all in memory at once? SQLite is pretty fast. Or how about a each hash element being represented, instead, by a MongoDB doc?

Dave

Comment on Re: Small Hash a Gateway to Large Hash?

Replies are listed 'Best First'.
Re^2: Small Hash a Gateway to Large Hash? by Anonymous Monk on Feb 18, 2014 at 21:14 UTC
Thanks Dave- Interesting suggestion- I've asked before about putting a relational db like Postgres on the machine, and was told "nyet". But this is simply a perl package, and nobody needs to know it's almost an rdb. That said, the reason I'm thinking into this at all is to make the script run faster. Without any hard data, I'm fairly confident this gigantic hash is a bottleneck.	[reply]
Re^3: Small Hash a Gateway to Large Hash? by davido (Cardinal) on Feb 18, 2014 at 22:21 UTC
You already know that hash lookups have an O(1) order of growth in computational complexity as the hash grows. Collisions aren't that big of a deal; the chains, even for gigantic hashes, are kept pretty small. Rehashing is a more significant problem, but that only takes place as the hash grows, not once it has arrived at its finished state. Maybe if we knew more about the nature of your "big hash" data set we might find ways of pre-processing or re-structuring it for faster searches. But in the "general" case, it's hard to beat what you've got already. One generalized solution would be to use workers: share your hash by creating it first, and then forking worker processes -- possibly one or two per processor core. As long as they never write to the original data, the "copy on write" mechanism won't be engaged, and they will share the data set. Let them keep track of their own tallies, and when they're done, let them report back to the parent script who can reduce the work each worker provided back down to a single final answer. There may exist less general, but more optimal solutions for your specific data set, but we would need to have a better understanding of that data set before we could include or eliminate such strategies. Dave	[reply]