comment on

One link: bytes

Or, a little bit explained: there is a pragma / module named bytes that allows you to force Perl to use byte semantics for everything. It can be used in two ways:

use bytes; in a certain scope will force byte semantics for that scope. Similary, no bytes; will disable byte semantics for a scope.
Use the functions implemented in bytes instead of the CORE functions, i.e. bytes::length() instead of length. Make sure not to accidentally enable byte semantics for your file by NOT importing anything from bytes (i.e. write use bytes (); or require bytes; instead of use bytes;.

The functions in bytes are actually the CORE functions, called in wrapper functions with enforced byte semantics.

Note that Perl 5.12 warns not to use bytes except for debugging:

This pragma reflects early attempts to incorporate Unicode into perl and has since been superseded. It breaks encapsulation (i.e. it exposes the innards of how the perl executable currently happens to store a string), and use of this module for anything other than debugging purposes is strongly discouraged. If you feel that the functions here within might be useful for your application, this possibly indicates a mismatch between your mental model of Perl Unicode and the current reality. In that case, you may wish to read some of the perl Unicode documentation: perluniintro, perlunitut, perlunifaq and perlunicode.

I think you have exactly that mismatch problem here. All data you receive from outside your script comes as stream of bytes. As long as you do not decode those bytes (either manually or inside a library or by using a PerlIO layer), but instead just stuff them unmodified into a string, perl will not treat those bytes in a different way than it did before Unicode. Perl treats each byte as a single character, and length() will return the number of characters, which is equal to the number of bytes. When you decode those bytes, e.g. from UTF-8 or UTF-16, into Perls internal character representation, length() will still return the number of characters. But due to the decoding, it may be different from the number of bytes that were used to store the encoded string outside Perl.

Behind the scenes, Perl has two different ways to store strings. The ill-named UTF8 flag switches between the two ways. In "classic mode", the UTF8 flag is off, each byte represents a single character, like in ancient perls. In "Unicode mode", the UTF8 flag is on, a character may spread over several bytes. As far as I know, the string is currently stored in some kind of "relaxed" or "extended" UTF-8 encoding, hence the name of the flag. But it does not and should not matter. You should not be interested in the way perl stores characters in memory. The next release could start storing characters encoded as UTF-32 or a hypothetical UTF-64 and you should see absolutely no difference from inside perl. Unless, of course, you start flipping the UTF8 bit without changing the actual in-memory encoding. See Encode.

If you want to know how many bytes a string occupies in a certain encoding, you should use the Encode module to convert that string into a byte stream with that encoding, and get its length.

For the special case of HTTP::Request / HTTP::Response, both inherit from HTTP::Message, which treats the content as a string of bytes. So length($msg->content()) will always(*) return the number of bytes. HTTP::Message also has a decoded_content() method that returns a string of characters, that may or may not have the UTF8 flag set. length($msg->decoded_content(...)) will always return the number of characters, given a decodable content. To test if the content is decodable, call the decodable() method.

(*) "always" is not quite correct: You can replace the content with its decoded version by calling $msg->decode(); after that, length($msg->content()) returns the number of characters. You can also undo that, with $msg->encode($encoding).

Alexander

--
Today I will gladly share my knowledge and experience, for there are no sweeter words than "I told you so". ;-)

In reply to Re^5: Size and anatomy of an HTTP response by afoken
in thread Size and anatomy of an HTTP response by Discipulus

Are you posting in the right place? Check out Where do I post X? to know for sure.
Posts may use any of the Perl Monks Approved HTML tags. Currently these include the following:
<code> <a> <b> <big> <blockquote> <br /> <dd> <dl> <dt> <em> <font> <h1> <h2> <h3> <h4> <h5> <h6> <hr /> <i> <li> <nbsp> <ol> <p> <small> <strike> <strong> <sub> <sup> <table> <td> <th> <tr> <tt> <u> <ul>
Snippets of code should be wrapped in <code> tags not <pre> tags. In fact, <pre> tags should generally be avoided. If they must be used, extreme care should be taken to ensure that their contents do not have long lines (<70 chars), in order to prevent horizontal scrolling (and possible janitor intervention).
Want more info? How to link or How to display code and escape characters are good places to start.


Problems? Is your data what you think it is?
	PerlMonks