Beefy Boxes and Bandwidth Generously Provided by pair Networks
Syntactic Confectionery Delight
 
PerlMonks  

What data is code?

by Beatnik (Parson)
on Nov 22, 2001 at 03:32 UTC ( [id://126891]=perlquestion: print w/replies, xml ) Need Help??

Beatnik has asked for the wisdom of the Perl Monks concerning the following question:

Hey,

I'm fixing some bugs in Filter::CBC and I encountered a problem...
The code is encrypted if ran for the first time, it should be left alone the second time. Now how to detect if the code is encrypted the first time? (similar behaviour to Acme::Bleach)
  1. I tried checking for ; but the encrypted data can contain ;s as well (altho probably not as much - ignoring ObFus).
  2. Another try was to check for keywords. print,my and if came to mind but it seemed that a program without those keywords is still possible (altho unlikely).
  3. A thought that crossed my mind was to have some comment on an unencrypted line noting that the code was encrypted, but the user might remove it, change it, etc...
  4. My last attempt included eval. By eval'ing the current line you could see if it was more or less valid perl. Problem would be : code spread over multiple lines, empty lines, POD, variables declared earlier, blocks, etc. Also the meaning of the module is not to verify the code is valid but to make sure no double encryptions happen.
So I'm pretty much stuck. The 'checking-for-keywords' method will probably return the fewest false positives.
Any suggestions would be appreciated alot :)

Greetz
Beatnik
... Quidquid perl dictum sit, altum viditur.

Replies are listed 'Best First'.
Re (tilly) 1: What data is code?
by tilly (Archbishop) on Nov 22, 2001 at 03:56 UTC
    Attempt to compress the text with a widely used compression, like Compress::Zlib.

    If it compresses significantly, it is unencrypted.

    You will need to do some playing around to figure out your threshold. But that test should work pretty well.

      Even better, the LZW algorithms are known to fail catastrophically on data with few repeating patterns. Zlib can produce a 'compressed' file larger than the original file. This should occur in the cases that you describe in your post. As you say, programs tend to compress well, due to whitspace and repeated variable names. So you only have to check if the return from zlib is larger than the original file.

      ____________________
      Jeremy
      I didn't believe in evil until I dated it.

      I'm keeping the Compress stuff for another module set :)))

      Anyway, uhm the cipher and keyphrase are totally user dependent. The key feature is the encryption, not really the compression (altho filter stacking should be doable). Paul Marquess provides a simple compression filter in Filter tho.
      The compression would have the same problem, how do you know if it's compressed or not??

      Greetz
      Beatnik
      ... Quidquid perl dictum sit, altum viditur.
        Um, there is no problem.

        If you attempt to compress already compressed and/or encrypted data, you should not get significant further compression. If you do then your original encryption or compression was of pretty poor quality. But if you take realistic text (eg program code) and use popular compression algorithms, you will reliably get significant compression. (Which is why people use them in the first place.)

        Therefore by taking some text and attempting to compress it, you can tell normal data from compressed or encrypted data. For instance you might say that if you can reduce its length by 20% or more, it is normal text. Aside from a few short text sequences, you are unlikely to go wrong with a test like this.

        There is, however, no way upon casual inspection to distinguish compressed data from encrypted data from white noise. The reasons for this involve information theory.

Re: What data is code?
by blakem (Monsignor) on Nov 22, 2001 at 03:45 UTC
    You could do a frequency analysis of the characters. I'm sure perl programs have a much different character distribution than encrypted data. Encrypted data should probably have a very flat frequency graph (i.e. all characters occur in equal proportions) but perl code would have a much different set of ratios.

    -Blake

      Not that anybody asked, but here's the order of frequency for chars in the .pm files (lib/5.../*.pm and */*.pm) in 5.00503 and 5.6.1:
      etsia nrold$hucmfp)(b ;=y_g,'.:>"-E w{}CTIvS#kANOxRL<PM/D@F\1UqB0HG[]2V*&|+zXWY%j~!?^3K549786ZJQ`ö etisanr oldhucmf$pb()yg=_;-,#. >':" wCE}{TvISkAxN<OR\PL/MF0D@1qUBH2[]G*|&VzX+%Wj?3~^Y5!4K6987ZJQ`
      I inserted the linefeeds before the w's so's to fit.  The tabs are between b and ; and between . and > , in case they don't travel.
      The only mildly surprising thing is that the uppercase is used more than the numerals. (and j is used very little, and there's more )'s than ('s in 5.005, hmmm)

      Of course, the perl fingerprint is the buck($) being in the first 20 characters while all the digits are past the 60th.

        p
Re: What data is code?
by Zaxo (Archbishop) on Nov 22, 2001 at 04:03 UTC

    One possibility is to check character frequency. Your semicolon idea sounds good, if the number of semicolons is comparable to the number of newlines (and "\n" isn't encoded), chances are it's code. if newlines get encoded, too, lines will be unusually long.

    The primary single measure of language-like structure is called Friedman kappa. That is an index of coincidence. Scan the text, looking at each character and the character a fixed distance ahead. Increment a count if the two are equal. Score that as a percentage. Random or well-encoded text will score about 1/ alphabet length. The redundancy of useful language leads to an index in the range .05-.08 for natural language.

    After Compline,
    Zaxo

Re: What data is code?
by chipmunk (Parson) on Nov 22, 2001 at 03:54 UTC
    I'd go with solution #3, using an unencrypted comment that indicates the code has been encrypted. Yes, the user might remove or change it, but the user could just as easily fiddle with the encrypted form of the code itself. So, I don't think it's worth worrying especially about the integrity of the comment.
Re: What data is code?
by Anonymous Monk on Nov 22, 2001 at 03:52 UTC

    Before filtering, test for a known encrypted value in the code (a known comment that *you* insert when you do filter/encrypt). If the encrypted comment is there, don't encrypt, if it isn't there, prepend the special comment then encrypt.

      I can't manipulate the encrypting process itself since I use Crypt::CBC. The user (of the module) will provide keyphrase and cipher. I could check the very first string (til it matches an ;) after the use statement with eval tho :)

      Greetz
      Beatnik
      ... Quidquid perl dictum sit, altum viditur.

        It shouldn't matter that the user has control over the code encryption technique, you only need control the magic. When your filter is called it can encrypt a special string (magic cookie) by any means it wishes. Now, it can test the incoming code to see if that encrypted string is at the beginning: if(index($_,$cookie) == 0). If it is then the file is encrypted, so strip it off and then decrypt the rest using the user's specified parameters. If it isn't there, encrypt the code via the user's parameters, prepend your encrypted magic cookie, overwrite the source file and exit. You have control over the cookie, you can create it, test for it, add it or strip it as required.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://126891]
Approved by root
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others examining the Monastery: (8)
As of 2024-04-25 11:34 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found