in reply to Re: Calculating corruption
in thread Calculating corruption

well i have been reading some more and come across standard deviation and spread. would this be a pursuable possibility? because you could compare that against all other encrypted files (even tho they use a different key) and should be a similar outcome right?

std dev /should/ be moderately comparable from encrypted file to encrypted file of same data, right? or atleast within a certain range. if it is out of this certain range, then you can safely say it is more than likely corrupted right?

Replies are listed 'Best First'.
Re^3: Calculating corruption
by BrowserUk (Patriarch) on Oct 18, 2014 at 23:48 UTC
    well i have been reading some more and come across standard deviation. would this be a pursuable possibility? because you could compare that against all other encrypted files (even tho they use a different key) and should be a similar outcome right?

    Why? (Why would they have a similar StdDev?)

    Standard Deviation measures deviation from the mean. Given a full (eg. exhaustive, but necessarily small) set of all the possible datasets of a given size; the variance (and thus StdDev) of the standard deviations, would range, and be equally distributed, between zero and infinity.

    Hence,the StdDev of any single sample --of anything -- means exactly nothing!

    That is, if the inputs are exactly 'random'; then the standard deviations are linear; and thus, completely uninformative.


    With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.
      i am not trying to argue with you at all so please dont take it personally :)

      but actually upon further examination, one of the programs i have used in the past, has this std dev function. furthermore, there are a few different revisions of this said encrypted file. each of these different revisions have an expected outcome. once you compute the files std dev and compare that with the known expected values, usually if it is within a generally close range, then that means the file is not corrupted. and i am not saying this is the end all be all of how to check a file for corruption, but somehow this other program is able to compute it and it is within a reasonably expected range... everytime... and per revision of the file, unless the file is corrupted. maybe i need to script up something real quick and just check to see what the outcome will be :)
        furthermore, there are a few different revisions of this said encrypted file. each of these different revisions have an expected outcome. once you compute the files std dev and compare that with the known expected values, usually if it is within a generally close range, then that means the file is not corrupted.

        That makes no sense at all.

        Let's say the corruption that occurred was that every pair of bytes in the file was transposed -- eg. abcdefgh corrupted to badcfehg; a type of corruption that frequently occurs when files are written on big-endian machines and read on little-endian ones, or vice versa.

        Pretty much every type of statistical analysis applied to the bytes of the file will simply not change at all.

        Equally, if you change any 1 bit in any one byte of a (say) 1MB file, you'd need to calculate your standard deviation to an accuracy beyond the limits of double precision in order to detect the change.

        You're flogging a dead horse.


        With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        "Science is about questioning the status quo. Questioning authority".
        In the absence of evidence, opinion is indistinguishable from prejudice.
Re^3: Calculating corruption
by Jim (Curate) on Oct 19, 2014 at 00:08 UTC

    The statistical method you describe to determine the likelihood that a stream of bytes is "corrupted" (i.e., altered in some way from its original state) will only work for a very specific kind of corruption:  the kind that results in the assumed randomness of the bytes (due to encryption) being measurably reduced. If this is exactly the kind of corruption you expect and want to identify when it occurs, and you don't expect or want to identify any other kind of corruption, then the statistical method you describe may be useful to you.

    Let's say you have an encrypted file that consists of 1,234,567,890 bytes. One arbitrary bit of one arbitrary byte is switched from 0 to 1, or vice versa. The file is now "corrupted" (i.e., altered from its original state). You will never discover this corruption after the fact by any statistical method (guesswork).

      "You will never discover this corruption after the fact by any statistical method (guesswork)."

      yes sir, i competently understand that, and realise there is no way to actually tell if a encrypted file is corrupted in anyway, but you can measure certain things to help signify (to a certain extent) if the file is corrupted or partially corrupted. otherwise you would need the means to decrypt the file and checksum it like said earlier, which will not work because the file cannot be decrypted because the keys are not known and more than likely will never be known. so i am just trying to come up with some methods to check it for any possibility of being corrupt.

      the program i used a long time ago computed this std dev from any given file. and from each revision of this file, the std dev was always within a marginal range of the expected outcome. if it was WAY off, then you know the file was probably corrupted.

      that along with calculating entropy + byte for byte repetition checking + the percentage of how many times each byte character is in said file will go along way i think :)
        that along with calculating entropy + byte for byte repetition checking + the percentage of how many times each byte character is in said file will go along way i think :)
        You seem to assume that your encrypted file is more or less like a stream of random characters and thus any "deviation" from such "randomness" indicates a corruption.

        This of course is a false assumption. There is no need nor reason why an encrypted file should be anything like random noise.

        Consider the unbreakable encryption of the "one time pad", or in other words, a key with a length not smaller than the message to encrypt the message. Unless you have access to the key, your encrypted file can be anything but it can never be decrypted. There is absolutely no way you can discern a properly encrypted file from a corrupted file, since actually any string of characters can mean anything. It all depends on the content of the key.

        If your encryted file shows certain characteristics, the lack of which indicate corruption, then the original encryption by definition was less secure.

        CountZero

        A program should be light and agile, its subroutines connected like a string of pearls. The spirit and intent of the program should be retained throughout. There should be neither too little or too much, neither needless loops nor useless variables, neither lack of structure nor overwhelming rigidity." - The Tao of Programming, 4.1 - Geoffrey James

        My blog: Imperial Deltronics

        So with this clearer statement of your actual problem, we can see there's a statistical method you can use to determine if a collection of bytes is less random than expected. And I may be better able to help you nail down the simplest method than a smart person is precisely because I don't know mathematics or statistics very well.

        In an encrypted file, each of the 256 bytes from 0 through 255 will occur about the same number of times. They won't occur the exact same number of times, of course, but they'll mostly be very close in frequency. (This is one of your stated assumptions.) You can easily measure the maximum variance from the mean of the frequencies of one or more example encrypted files. I remember learning the word "epsilon" a few years ago. I think it applies here. You compute a useful epsilon to use to determine if one or more bytes of an encrypted file occur more or less frequently than expected. Wild outliers imply corruption.

        I used the word "variance" above. I think standard deviation is a measure of statistical variance. (I'm not going to google it now. I'm winging this explanation on intuition and poor memory.) I think of the epsilon I described above as being the result of computing the greatest percentage difference from the mean of the furthest outlier from the mean in a viable encrypted file. I don't know enough about standard deviation to know if it has anything to do with my naïve conception of "percentage difference from the mean." But I suspect it does.

        If "the keys are unknown and more than likely will never be known" the files cannot be decrypted, so who cares if they are corrupted or not?

        1 Peter 4:10