in reply to Re^2: Small Hash a Gateway to Large Hash?
in thread Small Hash a Gateway to Large Hash?

You already know that hash lookups have an O(1) order of growth in computational complexity as the hash grows. Collisions aren't that big of a deal; the chains, even for gigantic hashes, are kept pretty small. Rehashing is a more significant problem, but that only takes place as the hash grows, not once it has arrived at its finished state.

Maybe if we knew more about the nature of your "big hash" data set we might find ways of pre-processing or re-structuring it for faster searches. But in the "general" case, it's hard to beat what you've got already.

One generalized solution would be to use workers: share your hash by creating it first, and then forking worker processes -- possibly one or two per processor core. As long as they never write to the original data, the "copy on write" mechanism won't be engaged, and they will share the data set. Let them keep track of their own tallies, and when they're done, let them report back to the parent script who can reduce the work each worker provided back down to a single final answer.

There may exist less general, but more optimal solutions for your specific data set, but we would need to have a better understanding of that data set before we could include or eliminate such strategies.


Dave

  • Comment on Re^3: Small Hash a Gateway to Large Hash?