What data is code?

Beatnik has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re (tilly) 1: What data is code? by tilly (Archbishop) on Nov 22, 2001 at 03:56 UTC
Attempt to compress the text with a widely used compression, like Compress::Zlib. If it compresses significantly, it is unencrypted. You will need to do some playing around to figure out your threshold. But that test should work pretty well.	[reply]
Re: Re (tilly) 1: What data is code? by jepri (Parson) on Nov 22, 2001 at 06:28 UTC
Even better, the LZW algorithms are known to fail catastrophically on data with few repeating patterns. Zlib can produce a 'compressed' file larger than the original file. This should occur in the cases that you describe in your post. As you say, programs tend to compress well, due to whitspace and repeated variable names. So you only have to check if the return from zlib is larger than the original file. ____________________ Jeremy I didn't believe in evil until I dated it.	[reply]
Re: Re (tilly) 1: What data is code? by Beatnik (Parson) on Nov 22, 2001 at 04:01 UTC
I'm keeping the Compress stuff for another module set :))) Anyway, uhm the cipher and keyphrase are totally user dependent. The key feature is the encryption, not really the compression (altho filter stacking should be doable). Paul Marquess provides a simple compression filter in Filter tho. The compression would have the same problem, how do you know if it's compressed or not?? Greetz Beatnik ... Quidquid perl dictum sit, altum viditur.	[reply]
Re (tilly) 3: What data is code? by tilly (Archbishop) on Nov 22, 2001 at 04:13 UTC
Um, there is no problem. If you attempt to compress already compressed and/or encrypted data, you should not get significant further compression. If you do then your original encryption or compression was of pretty poor quality. But if you take realistic text (eg program code) and use popular compression algorithms, you will reliably get significant compression. (Which is why people use them in the first place.) Therefore by taking some text and attempting to compress it, you can tell normal data from compressed or encrypted data. For instance you might say that if you can reduce its length by 20% or more, it is normal text. Aside from a few short text sequences, you are unlikely to go wrong with a test like this. There is, however, no way upon casual inspection to distinguish compressed data from encrypted data from white noise. The reasons for this involve information theory.	[reply]
Re: Re (tilly) 3: What data is code? by Beatnik (Parson) on Nov 22, 2001 at 04:22 UTC
Re (tilly) 5: What data is code? by tilly (Archbishop) on Nov 22, 2001 at 05:11 UTC
Some notes below your chosen depth have not been shown here
Re: What data is code? by blakem (Monsignor) on Nov 22, 2001 at 03:45 UTC
You could do a frequency analysis of the characters. I'm sure perl programs have a much different character distribution than encrypted data. Encrypted data should probably have a very flat frequency graph (i.e. all characters occur in equal proportions) but perl code would have a much different set of ratios. -Blake	[reply]
Re: Re: What data is code? by petral (Curate) on Nov 22, 2001 at 05:09 UTC
Not that anybody asked, but here's the order of frequency for chars in the .pm files (lib/5.../.pm and /.pm) in 5.00503 and 5.6.1: etsia nrold$hucmfp)(b ;=y_g,'.:>"-E w{}CTIvS#kANOxRL<PM/D@F\1UqB0HG[]2V&\|+zXWY%j~!?^3K549786ZJQ`ö etisanr oldhucmf$pb()yg=_;-,#. >':" wCE}{TvISkAxN<OR\PL/MF0D@1qUBH2[]G*\|&VzX+%Wj?3~^Y5!4K6987ZJQ` [download] I inserted the linefeeds before the w's so's to fit. The tabs are between b and ; and between . and > , in case they don't travel. The only mildly surprising thing is that the uppercase is used more than the numerals. (and j is used very little, and there's more )'s than ('s in 5.005, hmmm) Of course, the perl fingerprint is the buck($) being in the first 20 characters while all the digits are past the 60th. p	[reply] [d/l]
Re: What data is code? by Zaxo (Archbishop) on Nov 22, 2001 at 04:03 UTC
One possibility is to check character frequency. Your semicolon idea sounds good, if the number of semicolons is comparable to the number of newlines (and "\n" isn't encoded), chances are it's code. if newlines get encoded, too, lines will be unusually long. The primary single measure of language-like structure is called Friedman kappa. That is an index of coincidence. Scan the text, looking at each character and the character a fixed distance ahead. Increment a count if the two are equal. Score that as a percentage. Random or well-encoded text will score about 1/ alphabet length. The redundancy of useful language leads to an index in the range .05-.08 for natural language. After Compline, Zaxo	[reply]
Re: What data is code? by chipmunk (Parson) on Nov 22, 2001 at 03:54 UTC
I'd go with solution #3, using an unencrypted comment that indicates the code has been encrypted. Yes, the user might remove or change it, but the user could just as easily fiddle with the encrypted form of the code itself. So, I don't think it's worth worrying especially about the integrity of the comment.	[reply]
Re: What data is code? by Anonymous Monk on Nov 22, 2001 at 03:52 UTC
Before filtering, test for a known encrypted value in the code (a known comment that you insert when you do filter/encrypt). If the encrypted comment is there, don't encrypt, if it isn't there, prepend the special comment then encrypt.	[reply]
Re: Re: What data is code? by Beatnik (Parson) on Nov 22, 2001 at 03:55 UTC
I can't manipulate the encrypting process itself since I use Crypt::CBC. The user (of the module) will provide keyphrase and cipher. I could check the very first string (til it matches an ;) after the use statement with `eval` tho :) Greetz Beatnik ... Quidquid perl dictum sit, altum viditur.	[reply] [d/l]
Re: Re: Re: What data is code? by Anonymous Monk on Nov 22, 2001 at 04:39 UTC
It shouldn't matter that the user has control over the code encryption technique, you only need control the magic. When your filter is called it can encrypt a special string (magic cookie) by any means it wishes. Now, it can test the incoming code to see if that encrypted string is at the beginning: `if(index($_,$cookie) == 0)`. If it is then the file is encrypted, so strip it off and then decrypt the rest using the user's specified parameters. If it isn't there, encrypt the code via the user's parameters, prepend your encrypted magic cookie, overwrite the source file and exit. You have control over the cookie, you can create it, test for it, add it or strip it as required.	[reply] [d/l]


Syntactic Confectionery Delight
	PerlMonks