Your real problem here is that you are asking compute nodes to do database I/O. Which utterly and completely defeats the purpose.
If you have "400 compute blades" and "398 of them are waiting for I/O" you have accomplished zero. Parallel compute-servers must be handed all of the data that they require such that they do not need to "wait" for anything. Ever. Unless all of them are straining the heat-dissipating capacity of your hardware to its utmost, they are not doing their job.
In your present design, the "ruling constraint" is the capacity of the database server, which completely frustrates nearly all of your parallel silicon.