Beefy Boxes and Bandwidth Generously Provided by pair Networks
Perl Monk, Perl Meditation

Re^3: Calculating corruption

by Jim (Curate)
on Oct 19, 2014 at 00:08 UTC ( #1104268=note: print w/replies, xml ) Need Help??

in reply to Re^2: Calculating corruption
in thread Calculating corruption

The statistical method you describe to determine the likelihood that a stream of bytes is "corrupted" (i.e., altered in some way from its original state) will only work for a very specific kind of corruption:  the kind that results in the assumed randomness of the bytes (due to encryption) being measurably reduced. If this is exactly the kind of corruption you expect and want to identify when it occurs, and you don't expect or want to identify any other kind of corruption, then the statistical method you describe may be useful to you.

Let's say you have an encrypted file that consists of 1,234,567,890 bytes. One arbitrary bit of one arbitrary byte is switched from 0 to 1, or vice versa. The file is now "corrupted" (i.e., altered from its original state). You will never discover this corruption after the fact by any statistical method (guesswork).

Replies are listed 'Best First'.
Re^4: Calculating corruption
by james28909 (Deacon) on Oct 19, 2014 at 00:18 UTC
    "You will never discover this corruption after the fact by any statistical method (guesswork)."

    yes sir, i competently understand that, and realise there is no way to actually tell if a encrypted file is corrupted in anyway, but you can measure certain things to help signify (to a certain extent) if the file is corrupted or partially corrupted. otherwise you would need the means to decrypt the file and checksum it like said earlier, which will not work because the file cannot be decrypted because the keys are not known and more than likely will never be known. so i am just trying to come up with some methods to check it for any possibility of being corrupt.

    the program i used a long time ago computed this std dev from any given file. and from each revision of this file, the std dev was always within a marginal range of the expected outcome. if it was WAY off, then you know the file was probably corrupted.

    that along with calculating entropy + byte for byte repetition checking + the percentage of how many times each byte character is in said file will go along way i think :)
      that along with calculating entropy + byte for byte repetition checking + the percentage of how many times each byte character is in said file will go along way i think :)
      You seem to assume that your encrypted file is more or less like a stream of random characters and thus any "deviation" from such "randomness" indicates a corruption.

      This of course is a false assumption. There is no need nor reason why an encrypted file should be anything like random noise.

      Consider the unbreakable encryption of the "one time pad", or in other words, a key with a length not smaller than the message to encrypt the message. Unless you have access to the key, your encrypted file can be anything but it can never be decrypted. There is absolutely no way you can discern a properly encrypted file from a corrupted file, since actually any string of characters can mean anything. It all depends on the content of the key.

      If your encryted file shows certain characteristics, the lack of which indicate corruption, then the original encryption by definition was less secure.


      A program should be light and agile, its subroutines connected like a string of pearls. The spirit and intent of the program should be retained throughout. There should be neither too little or too much, neither needless loops nor useless variables, neither lack of structure nor overwhelming rigidity." - The Tao of Programming, 4.1 - Geoffrey James

      My blog: Imperial Deltronics
        proper crypto would be pretty darn random if implemented right. i am no guru or anything, but i do know somethings about the cat and mouse game. the whole point of encrypting a message or a file is to make it look as a random stream of bytes as possible. proper ecdsa with a randomized key is something that is very hard to break. and from what i understand it would take multiple super computers thousands of years to break the encryption of a single file <.<

        but yeah, to me, when crypto is implemented right, the stream of bits generated are pretty random, cz it would not be a good thing to be dependably predictable ;) atleast not for this corporation anyway lol.

        and also like i said, the std dev of these files are all within a certain range and usually if they are off by 1 to 1.5%, then that usually means the file is corrupt. once i sit down and code the script to compute some statistics of said files, i will post a zip full of these files and you can try for yourself :)

      So with this clearer statement of your actual problem, we can see there's a statistical method you can use to determine if a collection of bytes is less random than expected. And I may be better able to help you nail down the simplest method than a smart person is precisely because I don't know mathematics or statistics very well.

      In an encrypted file, each of the 256 bytes from 0 through 255 will occur about the same number of times. They won't occur the exact same number of times, of course, but they'll mostly be very close in frequency. (This is one of your stated assumptions.) You can easily measure the maximum variance from the mean of the frequencies of one or more example encrypted files. I remember learning the word "epsilon" a few years ago. I think it applies here. You compute a useful epsilon to use to determine if one or more bytes of an encrypted file occur more or less frequently than expected. Wild outliers imply corruption.

      I used the word "variance" above. I think standard deviation is a measure of statistical variance. (I'm not going to google it now. I'm winging this explanation on intuition and poor memory.) I think of the epsilon I described above as being the result of computing the greatest percentage difference from the mean of the furthest outlier from the mean in a viable encrypted file. I don't know enough about standard deviation to know if it has anything to do with my nave conception of "percentage difference from the mean." But I suspect it does.

        yes, you hit the nail on the head i do believe checking the std dev for 0x00 - 0xFF byte characters. and this along with fore-mentioned, calculating entropy, checking percentage of how many times each byte shows up in a file ect, will help to determine (within a reasonable consideration) if the file is corrupt or not. tho this is not a 100% accurate way of telling though, but i think it is a good way to help, and is exactly what i am after. i am going to read up on standard deviation and try to script up something that will compute it per each of my files. i hope i get expected results

        also, thank everyone for their time and input :)

        ps also thanks for helping me figure out what my question should have been too. i really need to start taking the extra few mins to think about my question before i post. apologies

      If "the keys are unknown and more than likely will never be known" the files cannot be decrypted, so who cares if they are corrupted or not?

      1 Peter 4:10
        while downgrading ps3's, there are some per console data's. this per console data is what i am talking about. there is no way to get the keys for the few files i am talking about without destroying hardware. when you dump the flash contents, these very sensitive data's are dumped along with it, and like said many times in this thread already, if one bit (in the billions of bits throughout the dump) is off at all, it will brick the system.
        thats why it is useful to have many different methods to be able to check this per console data, because there is no way to decrypt it without destroying hardware to get the keys (afaik). and if you flash back bad data. you have a nice paper weight on your hands that cannot be salvaged EVER.

        and i need to say again i guess, there is no way to tell 100%, this has been established many times already, but the methods used have an expected outcome that has been tried and true on thousands and thousands of these console dumps. if the data falls outside of a certain range from any statistical analysis, then you can place all bets on the dump is bad.

Log In?

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://1104268]
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others imbibing at the Monastery: (5)
As of 2021-01-25 02:37 GMT
Find Nodes?
    Voting Booth?