Beatnik has asked for the wisdom of the Perl Monks concerning the following question:
Hey,
I'm fixing some bugs in Filter::CBC and I encountered a problem...
The code is encrypted if ran for the first time, it should be left alone the second time. Now how to detect if the code is encrypted the first time? (similar behaviour to Acme::Bleach)
- I tried checking for ; but the encrypted data can contain ;s as well (altho probably not as much - ignoring ObFus).
- Another try was to check for keywords. print,my and if came to mind but it seemed that a program without those keywords is still possible (altho unlikely).
- A thought that crossed my mind was to have some comment on an unencrypted line noting that the code was encrypted, but the user might remove it, change it, etc...
- My last attempt included eval. By eval'ing the current line you could see if it was more or less valid perl. Problem would be : code spread over multiple lines, empty lines, POD, variables declared earlier, blocks, etc. Also the meaning of the module is not to verify the code is valid but to make sure no double encryptions happen.
So I'm pretty much stuck. The 'checking-for-keywords' method will probably return the fewest false positives. Any suggestions would be appreciated alot :)
Greetz
Beatnik
... Quidquid perl dictum sit, altum viditur.
Re (tilly) 1: What data is code?
by tilly (Archbishop) on Nov 22, 2001 at 03:56 UTC
|
Attempt to compress the text with a widely used compression,
like Compress::Zlib.
If it compresses significantly, it is unencrypted.
You will need to do some playing around to figure out your
threshold. But that test should work pretty well. | [reply] |
|
| [reply] |
|
I'm keeping the Compress stuff for another module set :)))
Anyway, uhm the cipher and keyphrase are totally user dependent. The key feature is the encryption, not really the compression (altho filter stacking should be doable). Paul Marquess provides a simple compression filter in Filter tho. The compression would have the same problem, how do you know if it's compressed or not??
Greetz
Beatnik
... Quidquid perl dictum sit, altum viditur.
| [reply] |
|
Um, there is no problem.
If you attempt to compress already compressed and/or
encrypted data, you should not get significant further
compression. If you do then your original encryption or
compression was of pretty poor quality. But if you take
realistic text (eg program
code) and use popular compression algorithms, you will
reliably get significant compression. (Which is why
people use them in the first place.)
Therefore by taking some text and attempting to compress
it, you can tell normal data from compressed or encrypted
data. For instance you might say that if you can reduce
its length by 20% or more, it is normal text. Aside from
a few short text sequences, you are unlikely to go
wrong with a test like this.
There is, however, no way upon casual inspection to
distinguish compressed data from encrypted data from
white noise. The reasons for this involve
information theory.
| [reply] |
|
|
|
Re: What data is code?
by blakem (Monsignor) on Nov 22, 2001 at 03:45 UTC
|
You could do a frequency analysis of the characters. I'm sure perl programs have a much different character distribution than encrypted data. Encrypted data should probably have a very flat frequency graph (i.e. all characters occur in equal proportions) but perl code would have a much different set of ratios.
-Blake
| [reply] |
|
Not that anybody asked, but here's the order of frequency for chars in the .pm files (lib/5.../*.pm and */*.pm) in 5.00503 and 5.6.1:
etsia
nrold$hucmfp)(b ;=y_g,'.:>"-E
w{}CTIvS#kANOxRL<PM/D@F\1UqB0HG[]2V*&|+zXWY%j~!?^3K549786ZJQ`ö
etisanr
oldhucmf$pb()yg=_;-,#. >':"
wCE}{TvISkAxN<OR\PL/MF0D@1qUBH2[]G*|&VzX+%Wj?3~^Y5!4K6987ZJQ`
I inserted the linefeeds before the w's so's to fit.  The tabs are between b and ; and between . and > , in case they don't travel.
The only mildly surprising thing is that the uppercase is used more than the numerals. (and j is used very little, and there's more )'s than ('s in 5.005, hmmm)
Of course, the perl fingerprint is the buck($) being in the first 20 characters while all the digits are past the 60th.
  p | [reply] [d/l] |
Re: What data is code?
by Zaxo (Archbishop) on Nov 22, 2001 at 04:03 UTC
|
One possibility is to check character frequency. Your semicolon idea sounds good, if the number of semicolons is comparable to the number of newlines (and "\n" isn't encoded), chances are it's code. if newlines get encoded, too, lines will be unusually long.
The primary single measure of language-like structure is called Friedman kappa. That is an index of coincidence. Scan the text, looking at each character and the character a fixed distance ahead. Increment a count if the two are equal. Score that as a percentage. Random or well-encoded text will score about 1/ alphabet length. The redundancy of useful language leads to an index in the range .05-.08 for natural language.
After Compline, Zaxo
| [reply] |
Re: What data is code?
by chipmunk (Parson) on Nov 22, 2001 at 03:54 UTC
|
I'd go with solution #3, using an unencrypted comment that indicates the code has been encrypted. Yes, the user might remove or change it, but the user could just as easily fiddle with the encrypted form of the code itself. So, I don't think it's worth worrying especially about the integrity of the comment. | [reply] |
Re: What data is code?
by Anonymous Monk on Nov 22, 2001 at 03:52 UTC
|
Before filtering, test for a known encrypted value in the code (a
known comment that *you* insert when you do filter/encrypt). If the
encrypted comment is there, don't encrypt, if it isn't there, prepend
the special comment then encrypt.
| [reply] |
|
I can't manipulate the encrypting process itself since I use Crypt::CBC. The user (of the module) will provide keyphrase and cipher. I could check the very first string (til it matches an ;) after the use statement with eval tho :)
Greetz
Beatnik
... Quidquid perl dictum sit, altum viditur.
| [reply] [d/l] |
|
It shouldn't matter that the user has control over the code
encryption technique, you only need control the magic. When your
filter is called it can encrypt a special string (magic cookie) by
any means it wishes. Now, it can test the incoming code to see if
that encrypted string is at the beginning: if(index($_,$cookie) == 0). If it is then the file is encrypted, so strip it off
and then decrypt the rest using the user's specified parameters. If
it isn't there, encrypt the code via the user's parameters, prepend
your encrypted magic cookie, overwrite the source file and exit. You
have control over the cookie, you can create it, test for it, add it
or strip it as required.
| [reply] [d/l] |
|
|