Nothing, it's a benchmark. It's the nature of benchmarks to consider code in isolation, with as minimal side-effects as possible to measure the differences. It doesn't measure a whole program, or what affect each solution may have.
However, this is also the nature of premature and micro-optimizations. The point of the benchmark was simply to show that assuming keys/values is slower than each is not always correct. Any decisions beyond that should only be measured based on actual code, with a profile, and an indication that this part of the code is the bottleneck.