I learned that the hard way the other day. Since I don't need "exactly-the-same" results from my tests, I now normalize them and do a character count to make sure they're not too far off from each other. For what I'm doing, and at this stage of my process, that's perfectly fine.
I'm also finding, with the constant back and forth between code and running a benchmark, a want for some other features in the Benchmark module. I wrote some stuff that makes the process faster and easier. At the end of this project, I might clean it up and get it up here.