It will only block between threads with testa (my approach) if the next thread to output is not the first one finished. As soon as the next thread to output is finished, a thread is created for the next chunk to be processed. This can probably even be improved if I up the semaphore at the end of the worker thread instead of after the next one to output is joined.
With testb (your approach), the semaphore is up'ed as soon as any chunk of data finishes processing allowing another chunk to be queued and processed. That should theoretically be more efficient, but it isn't. It is 5x slower. Even with ikegami's clean and elegant solutions using your approach, it is till 5x slower. I suspect it has to do with how memory is managed passing data structures to threads as opposed to making it shared in a queue.