Firing query generally takes much less time than comparison and this is the primay reason y i want to segregate things this way.
Actually, that's a very strong argument for not splitting your workers. Let's assume that a query takes 1/10th the time of a comparison. If you have 10 Queriers, then you would need at least 100 Comparers to keep up, and that's if the communications and data transfers between them took no time at all, which isn't the case. So now you have 100 Comparers each holding 10k of data. Except that transfering data from one place to another is going involve duplication which can double or treble the memory consumption.
And if the Comparers are slower than the Queriers, then they latter are going to run away, stacking up work for the former and filling memory. So now you have to consider adding semaphores to interlock the queues and prevent runaway and memory meltdown. And that adds complexity with the need for synchronisation, and the risk of deadlocking and all that nasty stuff.
On the otherhand, if you stick with one type of worker that picks a work item, fires the query, retrieves that data, performs the comparison, cleans up and goes back for the next work item. It's a very straight forward linear flow. Things can never get out of sync. You can never get runaway. To do the work more quickly, you simple start more workers.
There will be a limit to how many you can run concurrently, defined either by the memory available, the cpu power required for the processing, or the IO bandwidth. Which limit you will hit first will very much depend upon the details of the task and the hardware involved. But with the simple architecture, there is only one variable to adjust. I strongly recommend the simple approach.
If, once you have that coded and running and can see how it performs, and you find out which limitation forms the boundary to scalability, then on the basis of that knowledge you can consider tweeking the architecture to address it. But with a well configured 12-cpu machine you described elsewhere, trying to guess up front whether your process will be CPU, memory or IO bound at the limit is simply not possible.
I strongly advise sticking with the simple model. Make it work for 1 worker, and then 2. Once you're absolutely sure that it works correctly in both those configurations, then start ramping the number of workers. Start with 1 thread per cpu and see how that affects your throughput rate. Then try doubling the number and test again.
Note the throughput rate (from query issued to comparison completed and cleanup performed), memory consumption, and cpu load average at each change of concurrency level. After a few short tests, each of say 40 or 50 completed cycles, you should be able to plot a graph or two that will show you how the numbers vary with the number of threads. And that should allow you to plan the best production run strategy.
But make it work for one thread and then two threads and fully test it correctness in both scenarios first! I cannot emphasis enough, the importance of making sure it works properly, before you move into a performance testing phase.
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.