Beefy Boxes and Bandwidth Generously Provided by pair Networks
Come for the quick hacks, stay for the epiphanies.
 
PerlMonks  

comment on

( [id://3333]=superdoc: print w/replies, xml ) Need Help??
I've been thinking about grabbing the 1st "bunch" of characters, and determining if they're within the printable range, however this method is not fool proof.

heh... Printable in what language(s), using what character encoding(s)?

Does uuencoded or base64 encoded data count as "printable", or as "binary"?

I'll confess to being clueless about the details of TCP, and ask a presumably stupid question: how would the re-assembly of packets need to differ, based on whether or not you consider the content to be "binary"?

Doing a more extensive analysis using dipthongs/whitespace/vowels etc i think would be too slow.

If you're talking about deciding whether or not the content would qualify as "human-readable text", again, I'm hindered by ignorance of TCP (what's the typical packet size?) -- and I'd have to repeat the earlier questions (which human language(s)? which character encoding(s)?) -- but modeling readable text, in terms of the relative probabilities of occurrence for individual characters, would not be very hard, and could be quite robust with test strings of as few as 32 characters (the more, the better, of course).

Essentially, you "train" one model on some suitably large set of known human-readable text (just 10K words would probably do), consisting of the probabilities for each printable character; then train another model on a (preferably larger) set of data known to contain little or no readable text (or maybe just assume equal probabilities for all printable byte values).

For a given stream of input data to be classified, if it contains non-printables, it's probably not text and you're probably done; but if it contains only printable characters (e.g. could be base64 encoded), compute the relative proporions of occurrence over the set of printable characters, and measure the error between these proportions and each of the two models. If the error relative to the human-readable model is significantly lower, the input is human readable. (Unless of course it's spam, which is often tailored to match the unigram character statistics of a language, without regard to readability...)

If you are worried about speed, though, you'd be better off doing it in C rather than Perl.

(update: In case the question about character encoding didn't make this clear: the modeling of human-readable text would need to be limited to a training set that was homogeneous, at least with respect to character encoding. If your "human" training data includes a mix of UTF16, UTF8, GB2312, Big5, ShiftJIS, etc, it's going to end up not that different from the "binary" model. And if we're talking about any flavor of unicode, you also need to limit yourself to a given language (or group of closely related languages) -- for one thing, the definition of 'what is printable' varies widely...)


In reply to Re: Sniffing binary data, heuristics? by graff
in thread Sniffing binary data, heuristics? by Ryszard

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post; it's "PerlMonks-approved HTML":



  • Are you posting in the right place? Check out Where do I post X? to know for sure.
  • Posts may use any of the Perl Monks Approved HTML tags. Currently these include the following:
    <code> <a> <b> <big> <blockquote> <br /> <dd> <dl> <dt> <em> <font> <h1> <h2> <h3> <h4> <h5> <h6> <hr /> <i> <li> <nbsp> <ol> <p> <small> <strike> <strong> <sub> <sup> <table> <td> <th> <tr> <tt> <u> <ul>
  • Snippets of code should be wrapped in <code> tags not <pre> tags. In fact, <pre> tags should generally be avoided. If they must be used, extreme care should be taken to ensure that their contents do not have long lines (<70 chars), in order to prevent horizontal scrolling (and possible janitor intervention).
  • Want more info? How to link or How to display code and escape characters are good places to start.
Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others contemplating the Monastery: (4)
As of 2024-03-29 08:57 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found