|Don't ask to ask, just ask|
You say that Disks are slow. But network IO is slower!
Your solution is to keep all the data in memory by distributing it across multiple machines, but for 130,000 bibliographies, a (these days quite modest) 2GB machine could handle 16KB per record. Even with structural overheads that is a fairly expansive bibliography. By moving to even the most basic 64-bit OS on commodity hardware, that can limit can be at least quadrupled.
Whilst you gain through paralellism of the multiple boxes, you loose through networking (IO and coordination). You say "This is where merge to central Redis from nodes comes into play, removing need for merge step all-together at expense of network traffic.", but one way or another, the results from multiple machines has to end up being gathered back together. And that means it has to transition the network and be written to disk.
These days, even commodity boxes come with 2 or 4 cores, with 6 or 8 becoming the norm at the next inventory upgrade cycle. If you applied the same 'remote processes' solution you are using with networked boxes within a fully-RAMed, multi-core box, the 'networking' overhead becomes little more than serialisation through shared memory buffers (plus some protocol overhead that you'd have anyway).
If you use threads instead of processes, each core can process over a segment of shared memory, avoiding all the networking overhead entirely. And by returning references to the shared memory rather than data, to the merging thread, all network IO is avoided completely.
And for scalability; with 64-bit OSs capable of handling at least 128GB of virtual memory, you'd have plenty of room for growth. Whilst random access to virtual memory greater than physical memory will 'thrash the cache', there are still huge gains to be had by avoiding the networking protocal overhead.
Please don't take this as a critisism of your sharing your current solution. That's all good. I just can't help think ing that for problems of this scale, and those an order of magnitude or two larger, it is a solution with a quite limited lifespan.
For a while now I've been looking at a solution of using virtualised memory (memory mapped files) to provide generalised, thread-shared, random read-write access to large datasets (up to ~128GB), and your problem is the first real-world task that has come up here. If you could share your dataset, or at least a reasonably detailed schema sufficient to allow the creation of a realistic testset to be generated, it would be extremely interesting (to me) to see how the two approaches to the problem compare. In terms of both complexity and performance.
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.