http://qs321.pair.com?node_id=1104276


in reply to Re^4: Calculating corruption
in thread Calculating corruption

So with this clearer statement of your actual problem, we can see there's a statistical method you can use to determine if a collection of bytes is less random than expected. And I may be better able to help you nail down the simplest method than a smart person is precisely because I don't know mathematics or statistics very well.

In an encrypted file, each of the 256 bytes from 0 through 255 will occur about the same number of times. They won't occur the exact same number of times, of course, but they'll mostly be very close in frequency. (This is one of your stated assumptions.) You can easily measure the maximum variance from the mean of the frequencies of one or more example encrypted files. I remember learning the word "epsilon" a few years ago. I think it applies here. You compute a useful epsilon to use to determine if one or more bytes of an encrypted file occur more or less frequently than expected. Wild outliers imply corruption.

I used the word "variance" above. I think standard deviation is a measure of statistical variance. (I'm not going to google it now. I'm winging this explanation on intuition and poor memory.) I think of the epsilon I described above as being the result of computing the greatest percentage difference from the mean of the furthest outlier from the mean in a viable encrypted file. I don't know enough about standard deviation to know if it has anything to do with my naïve conception of "percentage difference from the mean." But I suspect it does.

Replies are listed 'Best First'.
Re^6: Calculating corruption
by james28909 (Deacon) on Oct 19, 2014 at 01:01 UTC
    yes, you hit the nail on the head i do believe checking the std dev for 0x00 - 0xFF byte characters. and this along with fore-mentioned, calculating entropy, checking percentage of how many times each byte shows up in a file ect, will help to determine (within a reasonable consideration) if the file is corrupt or not. tho this is not a 100% accurate way of telling though, but i think it is a good way to help, and is exactly what i am after. i am going to read up on standard deviation and try to script up something that will compute it per each of my files. i hope i get expected results

    also, thank everyone for their time and input :)

    ps also thanks for helping me figure out what my question should have been too. i really need to start taking the extra few mins to think about my question before i post. apologies
      ps also thanks for helping me figure out what my question should have been too. i really need to start taking the extra few mins to think about my question before i post. apologies

      You're welcome, and no need to apologize. But, honestly, it would help a lot of if you'd fix the broken Shift keys on your computer. ;-)