Or you could standardize the internal representation. A string is a sequence of code points. Storing the sequence length could be handy when dealing predominantly with string objects. Then the following cases arise:
- (nbytes == 0 && ncodepts == 0) trivial case/empty/false
- (nbytes > 0 && ncodepts == 0) binary blob
- (nbytes > 0 && ncodepts == nbytes) with UTF-8 internal rep, this means string is plain ASCII
- (nbytes > 0 && ncodepts < nbytes) generic unicode string
Extended 8-bit charsets (ISO8859) suffer with UTF-8 internal representation, unless you hack the (ncodepts==nbytes) to indicate native format...
More interesting is the interaction between objects. Considering a blob and a string object:
$foo = ($str . $obj);
$bar = ($obj . $str);
$baz = "${obj}${str}";
When is the blob promoted to a string, when does the opposite happen? Object representation and efficiency are certainly big concerns, but surely the semantic implications of unicode are far more insidious.