Presumably, this is a continuation of the older thread.
I had an _epi32 mergesort implemented from earlier time (*), and so repurposed it for the given task. No doubt a parallel version would scale rather nicely, too, though this I haven't tried. The 7360x4912 pixels of sciurine menace obtained via dcraw.
$ perf stat perl rgb48.pl squirrl.dat
[1452115604.189627] sort_and_uniq:
[ 0.294368] first binning
[ 0.250713] second binning
[ 0.352289] merge and count
squirrl.dat == 35556508
Performance counter stats for 'perl rgb48.pl squirrl.dat':
937.769123 task-clock # 1.000 CPUs utilized
13 context-switches # 0.014 K/sec
0 cpu-migrations # 0.000 K/sec
55529 page-faults # 0.059 M/sec
2775248733 cycles # 2.959 GHz
1375100650 stalled-cycles-frontend # 49.55% frontend cycles idle
588508119 stalled-cycles-backend # 21.21% backend cycles idle
3397637328 instructions # 1.22 insns per cycle
# 0.40 stalled cycles per insn
310595954 branches # 331.207 M/sec
9170806 branch-misses # 2.95% of all branches
0.937321108 seconds time elapsed
Note, this is Lynnfield CPU without avx. Same test in a timethis loop:
timethis 10: 8 wallclock secs ( 8.33 usr + 0.00 sys = 8.33 CPU) @ 1.20/s (n=10)
There are other optimized sort implementations out there. Intel IPP (Integrated Performance Primitives) has the following routines, among myriad others
IppStatus IppsSortAscend_32s_I(Ipp32s* pSrcDst, int len);
IppStatus IppsSortAscend_64f_I(Ipp64f* pSrcDst, int len);
IppStatus IppsSortRadixAscend_32u_I(Ipp32u* pSrcDst, Ipp32u* pTmp, Ipp32s len);
...
I'd expect these to provide a well-optimized solution for any Intel platform.