Oh, so it is "slower" because you are doing all kinds of blocking between threads (not because it is more expensive). Don't do that.
If you need to preserve order, than create slots for the data and let the threads fill in the slots at their leisure. Way, way simpler code. I failed at trying to understand what the point of your overly complicated sample code was. Sorry.
Don't pass thread handles through queues. Pass work items. If you only have as many threads as you need, then you don't need the semaphore at all.