With testb (your approach), the semaphore is up'ed
No, with my approach, there is no semaphore.
Create N threads. Have them read work items from an input queue. Create an output thread that reads items from an output queue. Now feed input to the input queue. Create a work item by sharing a new HASH or ARRAY ref. Store a sequence number into the work item. Store what the worker needs to know into the work item. Add the work item to the input queue. When a worker thread pulls a work item, it replaces what only it needs to know with what it computes and then puts the result (still containing a sequence number) into the output queue.
The output thread just pulls stuff out of the output queue. It starts off knowing nothing other than the first sequence number. When it pulls a work item, if that work item's sequence number matches the next sequence number, then it can output it immediately. If not, it stores it into a not-shared hash with the sequence number as the key. Every time it outputs something it also increments the sequence number and looks the result up in its local hash. If that finds a match, then it deletes it from the hash and outputs it (which triggers a repeat of this process).
If I were writing it, I'd probably at least look at having the main thread handle both input and output. Then it could track (with a simple non-shared counter) how many work items are potentially in the combined queues and avoid letting that build up too high so that the sources of inputs can notice that intake is falling behind instead of just having RAM usage grow unbounded with no other signs of problems.