I come from a CS background. I've had algorithmic efficiency beat into me till it was coming out of my ears. After so many classes on data-structures and algorithm analysis I've become wary of built-in, higher-level operators like Perl's push, pop, shift and unshift list manipulation functions. Oh I've used them plenty, but there has always been this nagging little voice in the back of my head saying What if unshift takes longer to run with a larger list? and does it take longer to find the size of a bigger list?.
Questions like these led me on a spiritual quest through the bowels of Perl's source code.... The details of my journey follow.
Now before I dig too deep, I want to introduce the uninitiated to big-O notation. Big-O notation if a method of describing the efficiency of an algorithm as it relates in a very general way to the quantity of data the algorithm is operating on. Urri Guttman and Larry Rossler's article A Fresh Look at Efficient Perl Sorting (in addition to being a great article in its own rite) has a fantastic overview of big-O notation much better than what I have cobbled together here. Some common big-O analyses are as follows (in order from best to worst performance-wise).
- O(1) - algorithm takes a constant amount of time to run no matter what the dataset size. A good example of this is a hash table lookup.
- O(log(n)) - algorithm performance is proportional to the logarithm of the dataset size. A binary search is O(log(n)).
- O(n) - algorithm runtime scales linearly with dataset size. If it took 10 seconds with 200 elements, it will take 20 seconds with 400 elements. Scanning a list for an element with a particular value is O(n).
- O(n*log(n)) - the best sorting algorithms (quicksort, mergesort) take this long to run.
- O(n^2) - algorithm performance is proportional to the dataset size squared. Basic sorting algorithms (selection sort, insertion sort, bubble sort) take this long to run.
- O(2^n) - only the slowest algorithms take this long to run. Traveling salesman problem, etc.
A straightforward list implementation that every introductory CS student learns uses an array and an offset to the last in-use array element to represent a list. This implementation works well for push, pop, determining number of array elements, insertion, fetching, etc. However, performance of shift and unshift with this implementation is horrible: O(n)! This is because there is no room at the beginning of the array and all the elements have to be shifted by one to insert/remove an element from the front of the list.
There are many other ways to implement lists (some better, some worse, others with different performance tradeoffs), but without knowing which implementation Perl used (or at least seeing some notes in perlfunc or some other piece of perl documentation about the performance of these list manipulation functions) I was having trouble sleeping at night.
Thanks to some source-code browsing and the Perl Internals chapter of Advanced Perl Programming I am now happy and content to use all of Perl's built-in list operators without any loss of sleep.
Perl implements lists with an array and first/last element offsets. The array is allocated larger than needed with the offsets originally pointing in the middle of the array so there is room to grow in both directions (unshifts and pushes/inserts) before a re-allocation of the underlying array is necessary. The consequence of this implementation is that all of perl's primitive list operators (insertion, fetching, determining array size, push, pop, shift, unshift, etc.) perform in O(1) time.
There are some worst-case scenarios of push, unshift and insertion where performance is O(n). These occur when the array upon which the list is implemented has to be reallocated because the operation would access the array beyond its preallocated size. This reallocation overhead is a common factor in all list implementations that handle all the primitive operators in O(1) time.
One consequence of perl's list implementation is that queues implemented using perl lists end up "creeping forward" through the preallocated array space leading to reallocations even though the queue itself may never contain many elements. In comparison, a stack implemented with a perl list will only require reallocations as the list grows larger. However, perl is smartly coded because the use of lists as queues was anticipated. Consequently, these queue-type reallocations have a negligible impact on performance. In benchmarked tests, queue access of a list (using repeated push/shift operations) is nearly as fast as stack access to a list (using repeated push/pop operations).
The preallocated array is balanced toward the 0 end of the array (meaning there is more free space at the far end of the list than there is before the list's 0 element). This is done purposely to make pushes more efficient than unshifts (since perl programmers use push much more than they use unshift). As a consequence, a given number of unshifts will result in more reallocations than the same number of pushes. My benchmarking of a front-loaded queue (implemented using unshift and pop) shows it to be nearly %20 slower than a back-loaded queue (implemented using push and shift); I assume for this very reason.