A recent conversation in the chatterbox gave me the idea to write this. A beginning programmer was trying to encode some information with pack and unpack but was having trouble coming to grips with exactly how they work. I have never had trouble with them but I came to programming from a hardware background and I'm very familiar with assembly and C programming. People who have come to programming recently have probably never dealt with things at such a low level and may not understand how a computer stores data. A little understanding at this level might make pack and unpack a little easier to figure out.
Why we need pack and unpack
Perl can handle strings, integers and floating point values. Occassionally a perl programmer will need to exchange data with programs written in other languages. These other languages have a much larger set of datatypes. They have integer values of different sizes. They may only be capable of dealing with fixed length strings (dare I say COBOL?). Sometimes, there may be a need to exchange binary data over a network with other machines. These machines may have different word sizes or even store values differently. Somehow, we need to get our data into a format that these other programs and machines can understand. We also need to be able to interpret the responses we get back.
Perl's pack and unpack functions allow us to read and write buffers of data according to a template string. The template string allows us to indicate specific byte orderings and word sizes or use the local system's default sizes and ordering. This gives us a great deal of flexibility when dealing with external programs.
In order to understand how all of this works, it helps to understand how computers store different types of information.
Computer memory can be looked at as a large array of bytes. A byte contains eight bits and can represent unsigned values between 0 and 255 or signed values between -128 and 127. You can't do a whole lot of computation with such a small range of values so a modern computer's registers are larger than a byte. Most modern processors use 32 bit registers and there are some processors with 64 bit registers. A 32 bit register can store unsigned values between 0 and 4294967295 or signed values between -2147483648 and 2147483647.
When storing values greater than 8 bits long to memory, the value is broken up into 8 bit segments and stored in multiple consecutive storage locations. Some processors will store the segment containing the most significant bits in the first memory location and work up in memory with lesser segments. This is referred to as "big-endian" format. Other processors will store the least significant segment in the first byte and store more significant segments into higher memory locations. This is referred to as "little-endian" format.
This might be easier to see with a picture. Suppose a register contains the value 0x12345678 and we're trying to store it to memory at address 1000. Here's how it looks.
If you have looked at perldoc -f pack or have looked up the pack function in Programming Perl, you have seen a table listing template characters with a description of the type of datum they match. That table lists integer formats of several sizes and byte orders. There are also signed and unsigned versions.
The s, l, and q formats pack 16, 32, and 64 bit values in the host machine's native order. The i format packs a value of the host machine's word length. The n and v formats allow you to specify the size and storage order and are useful for interchange with other systems.
Strings are stored as arrays of characters. Traditionally, each character was encoded in a single byte using some coding system like ASCII or EBCDIC. Newer encoding systems like Unicode use either multi-byte or variable length encodings to represent characters.
Perl's pack function accepts the following template characters for strings.
Strings are stored in successive increasing memory locations with the first character in the lowest address location.
Perl's pack function
The pack function accepts a template string and a list of values. It returns a scalar containing the list of values stored according to the formats specified in the template. This allows us to write data in a format that would be readable by a program written in C or another language or to pass data to a remote system through a network socket.
The template contains a series of letters from the tables above. Each letter is optionally followed by a repeat count (for numeric values) or a length (for strings). A '*' on an integer format tells pack to use this format for the rest of the values. A '*' on a string format tells pack to use the length of the string.
Now, let's try an example. Suppose we're collecting some information from a web form and posting it for processing by our backend system which is written in C. The form allows a monk to request office supplies. The backend system wants to see input in the following format.
After looking through our system header files, we determine that time_t is a long. To create a suitable record for sending to the backend, we could use the following.
That template says 'a long, an int, a 32 character null terminated string and two shorts'.
If monk number 217641 (hey! that's me!) placed an urgent order for two boxes of paperclips on January 1, 2003 at 1pm EST, $rec would contain the following (first line in decimal, second in hex, third as characters where applicable). Pipe characters indicate field boundaries.
Let's figure out where all of that came from. The first template item is a 'l' which packs a long. A long is 32 bits or four bytes. The value that was stored came from the time function. The actual value was 1041444000 or 0x3e132ca0. See how that fits into the beginning of the buffer? My system has an Intel Pentium processor which is little endian.
The second template item is a 'i'. This calls for an integer of the machine's native size. The Pentium is a 32 bit processor so again we pack into four bytes. The monk's number is 217641 or 0x00035229.
The third template item is 'Z32'. This specifies a 32 character null terminated field. You can see the string 'boxes of paperclips' stored next in the buffer followed by zeros (null characters) until the 32 bytes have been filled.
The last template item is 's2'. This calls for two shorts which are 16 bit integers. This consumes two values from the list of values passed to pack. 16 bits get stored in two bytes. The first value was the quantity 2 and the second was the 1 indicating urgent. These two values occupy the last four bytes of the buffer.
Perl's unpack function
Unbeknownst to us when we wrote the web side of this application, someone was porting the backend from C to perl (something about eating dog food, I don't think I heard it right). But, since we've already written the web side of the application, they figured they would just use the same data format. Therefore, they need to use unpack to read the data we sent them.
unpack is kind of the opposite of pack. pack takes a template string and a list of values and returns a scalar. unpack takes a template string and a scalar and returns a list of values.
Theoretically, if we give unpack the same template string and the scalar produced by pack, we should get back the list of values we passed to pack. I say theoretically because if the unpacking is done on a machine with a different byte order (big vs. little endian) or a different word size (16, 32, 64 bit), unpack might interpret the data differently than pack wrote it. The formats we used all used our machine's native byte order and 'i' could be different sizes on different machines so we could be in trouble. But in our simple case, we'll assume the backend runs on the same machine as the web interface.
To unpack the data we wrote, the backend program would use a statement like this.
Notice that the template string is identical to the one we used above for packing and the same information is returned in the same order (except they used $ignore where we packed with $urgent, what are they trying to say?).