... the problem is only seen under heavy production load
Have you checked the memory consumption (either on a single instance of the script, or summed over multiple instances that might be running simultaneously on a given server)?
If any single instance takes up a significant amount of memory, then you can probably work out how many concurrent jobs it would take to swamp available RAM, and cause one or more of the jobs to go into severe swapping / page faulting. The standard unix/gnu "top" command might suffice to spot a problem of that sort as it's happening.
How hard will it be to reduce the memory footprint of your script? Alternately, how bad will it be to limit the number of simultaneous jobs? The thing about page-fault delays is that the timing impact is non-linear: making one job wait a few sec before it can really start -- to make sure that it is serialized relative to some other job (won't start till some other job finishes) -- can often lead to an overall faster completion than allowing it to run immediately and simultaneously, causing unsustainable competition for available resources, |