http://qs321.pair.com?node_id=1104276


in reply to Re^4: Calculating corruption
in thread Calculating corruption

So with this clearer statement of your actual problem, we can see there's a statistical method you can use to determine if a collection of bytes is less random than expected. And I may be better able to help you nail down the simplest method than a smart person is precisely because I don't know mathematics or statistics very well.

In an encrypted file, each of the 256 bytes from 0 through 255 will occur about the same number of times. They won't occur the exact same number of times, of course, but they'll mostly be very close in frequency. (This is one of your stated assumptions.) You can easily measure the maximum variance from the mean of the frequencies of one or more example encrypted files. I remember learning the word "epsilon" a few years ago. I think it applies here. You compute a useful epsilon to use to determine if one or more bytes of an encrypted file occur more or less frequently than expected. Wild outliers imply corruption.

I used the word "variance" above. I think standard deviation is a measure of statistical variance. (I'm not going to google it now. I'm winging this explanation on intuition and poor memory.) I think of the epsilon I described above as being the result of computing the greatest percentage difference from the mean of the furthest outlier from the mean in a viable encrypted file. I don't know enough about standard deviation to know if it has anything to do with my naïve conception of "percentage difference from the mean." But I suspect it does.