Beefy Boxes and Bandwidth Generously Provided by pair Networks
laziness, impatience, and hubris
 
PerlMonks  

How can I tell if a string contains binary data or plain-old text?

by Anonymous Monk
on Oct 31, 2003 at 00:37 UTC ( [id://303463]=perlquestion: print w/replies, xml ) Need Help??

Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

How can I, reasonably, tell if a string contains text—ISO Latin-1, but preferably Unicode—or arbitrary binary data (like a JPEG image).

  • Comment on How can I tell if a string contains binary data or plain-old text?

Replies are listed 'Best First'.
Re: How can I tell if a string contains binary data or plain-old text?
by graff (Chancellor) on Oct 31, 2003 at 04:06 UTC
    There is no single, simple answer to this question. In one sense, "plain-old text" is arbitrary binary data, unless you happen to know the human language the text is written in, and are reasonably sure that the text represents correct usage in that language with few or no typos, or occasional words quoted/borrowed from some other language, or line noise or other sort of corruption, etc. If the text is in a language that uses characters beyond 7-bit ASCII, the distinction between "text" and "not text" can be slippery.

    One general approach is to develop a statistical model of what you consider to be "text". Text data in any human language will have a fairly distinctive distribution of byte values, when compared to any non-linguistic data stream (including text that has been compressed, encrypted, and/or encoded via base64, uuencode, etc) -- or when compared to some other language, or when compared to data in the same lanuage when some alternate character encoding is used (e.g. CP437 vs. Latin1 vs. Unicode UC-16).

    That is, the relative probabilities of the 256 different byte values will be quite distinctive for a given language, using a given character encoding. Of course, the limitations are: classification is less reliable on short strings (but any test case of more than 60 bytes should be pretty robust); you need to have enough valid text data to build a decent model; and if you need to recognize "plain text" in different languages, or using different character encodings, you need separate models for each type of "target" you want to recognize. It also helps if you can build a relevant model of the "non-text" data you are likely to encounter. (If your model is based on bigrams -- i.e. the probabilities of byte pairs -- it can be much more powereful and accurate, but then you have 64K probabilities to keep track of, instead of 256.)

    Maybe this is not the sort of answer you were looking for? In any case, statistical classification methods are expected to be wrong some percentage of the time (both false positives and false negatives), and the vagaries of "text data" can often pose difficult boundary cases, like strings that contain some text, and some stuff that isn't text (e.g. the kind of crap you find in M$ Word "doc" files).

Re: How can I tell if a string contains binary data or plain-old text?
by davido (Cardinal) on Oct 31, 2003 at 01:13 UTC
    You can't do it yourself easily, though there are tricks. If you know that it's either Unicode OR a JPEG, you can look for the JPEG header, and rule JPEG out if the header isn't found. Or if you're limiting the text to standard ASCII, you can probably be pretty certain it's text if each byte's value is 127 or less. But that gets blown away if your text is 8-bit MIME or Unicode, or if you're looking at a UUEncoded file, which is a non-text entity encoded into 7-bit text-only characters for the purpose of easy SMTP transportability. A zipped or tarred file might look like binary data on the surface, but could contain a text file within. A UUEncoded file will look like text on the outside but may contain binary data within. Just like a JPEG looks like binary data on the outside and yet represents an image within.

    The problem is that the more varients of "plain old text" you consider to be plain old text, the more difficult it becomes to distinguish it from non-text.

    That being the case, you can guess based on various criteria.


    Dave


    "If I had my life to live over again, I'd be a plumber." -- Albert Einstein
      Slightly better than excluding characters over 127, is excluding characters from 1 to 31 inclusive, since those aren't used in any single byte, 8 bit encodings. They also aren't used as the first bytes in the variable length encodings, although this requires parsing the symbols to figure out which are the first bytes.
      Of course a few control characters will occur legitimately in text strings (e.g. EOF), but the percentage will be tiny compared to the ~12.5% you expect in most binaries.
        There's no EOF character in the ASCII set. There might be some filesystems that require files to use a particular character to signal the end of a file (for instance, the SUB (aka ^Z) character has been used), but most modern filesystems record the size of the file as meta data (often called inodes) and don't need a certain character to be present.

        However, some characters in the range 00-1F are found in text files: carriage returns (^M), line feeds (^J), tabs (^I), bells (^G), form feeds (^L) and backspaces (^H). Theoretically, one could find vertical tabs (^K) in text files as well, but I've never knowingly encountered such a thing in a text file.

        Abigail

Re: How can I tell if a string contains binary data or plain-old text?
by zentara (Archbishop) on Oct 31, 2003 at 01:32 UTC
    File::MMagic - Guess file type

    Or run the file thru the system command "file".

Re: How can I tell if a string contains binary data or plain-old text?
by dakkar (Hermit) on Oct 31, 2003 at 14:43 UTC

    First of all: you can't have a "Unicode" file.

    You can have a file containing Unicode code-points encoded in one of the transformation formats defined by the Unicode standard, such as UTF-8 or UTF-16.

    So the question becomes:

    I have a byte-stream. Is it a valid (ISO-8859-1|UTF-8|UTF-16)-encoded representation of some text?

    This can be answered, since none of those encodings defines a meaning for each and every byte-sequence. But this is quite possibly not the answer you're looking for.

    The way I see it, it's easier to check if your byte-stream contains something you know not to be text, using something like file(2) or File::MMagic as already suggested.

    Doing it the other way ("is it a valid encoded form") gives you a lot of "this is text" when, in fact, it is nothing intelligible.

    You could try to decode it and then do some heuristics to see if looks like text (ex. a lot of letters from the same script/writing system in a row, or something of the sort), but I think it's more trouble than it's worth.

    -- 
            dakkar - Mobilis in mobile
    

    Most of my code is tested...

    Perl is strongly typed, it just has very few types (Dan)

Re: How can I tell if a string contains binary data or plain-old text?
by ambrus (Abbot) on Oct 31, 2003 at 11:19 UTC
    Why does this one not work (on perl 5.8.0)? It should, I think, but -T always gives undef.

    open A,"<",\$a or die 1;print +(-T(A) ? "text" : "bin"), $/;close A;'

      Odd, really. It works with a regular file, but not with an in-memory file... Bug?

      $ perl -e '$a="abc"x500;open A,"<",\$a;$x=-T(A); print +(defined($x)?$ +x?"text":"bin":"undef"),$/;close A' undef $ perl -e '$a="abc";open A,"<","/tmp/index.html";$x=-T(A); print +(def +ined($x)?$x?"text":"bin":"undef"),$/;close A' text
      -- 
              dakkar - Mobilis in mobile
      

      Most of my code is tested...

      Perl is strongly typed, it just has very few types (Dan)

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://303463]
Approved by revdiablo
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others contemplating the Monastery: (3)
As of 2024-04-23 05:26 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found