http://qs321.pair.com?node_id=68950

Lexicon has asked for the wisdom of the Perl Monks concerning the following question:

I've spent the last few weeks optimizing and reoptimizing and re(n)optimizing a math module (Math::Combinator) I've been working on (to be released here soon), and had to wonder exactly how fast certain thing were, and how fast they were compared to each other. These things are the fundamental components of most programming languages: arrays, hashes, subroutines (by ref), sub's (by value). So, duh, Benchmark.
use Benchmark; $a = shift; $b = shift; print "A = $a\tB = $b\n"; $hash{$_} = $_ for (1..$b); @array[$_] = $_ for (1..$b); sub doSubR { ${$_[0]} = ${$_[0]}} sub doSubV { return $_[0] } timethese ( $a, { '1 Nothing ' => 'for (1..$b) { $trash = $_ };', '2 Mult ' => 'for (1..$b) { $trash = $_ * $_ };', '3 Array ' => 'for (1..$b) { $trash = $array[$_] };', '4 Hash ' => 'for (1..$b) { $trash = $hash{"$_"} };', '5 Sub (Ref)' => 'for (1..$b) { doSubR(\$_); $trash = $_ };', '6 Sub (Val)' => 'for (1..$b) { $trash = doSubV($_) };', });
For A = 1,000 and B = 10,000 I get these results, which I find fairly satisfying for accuracy:
[lexicon]$ perl bench.pl 10 1000000 A = 10 B = 1000000 Benchmark: timing 10 iterations of 1 Nothing , 2 Square , 3 Array + , 4 Hash , 5 Sub (Ref), 6 Sub (Val)... 1 Nothing : 35 wallclock secs (29.99 usr + 0.10 sys = 30.09 CPU) 2 Square : 58 wallclock secs (48.79 usr + 0.23 sys = 49.02 CPU) 3 Array : 63 wallclock secs (54.49 usr + 0.20 sys = 54.69 CPU) 4 Hash : 184 wallclock secs (156.99 usr + 0.60 sys = 157.59 CPU) 5 Sub (Ref): 388 wallclock secs (327.87 usr + 1.30 sys = 329.17 CPU) 6 Sub (Val): 312 wallclock secs (266.54 usr + 1.02 sys = 267.56 CPU)

Here's my interpretation of things (the Relative column translates to the number of multiplication operations that could be performed in the time it takes to perform the Function):

Function Time(S) Relative Mult: 19 01.00 Array: 24 01.25 Hash: 127 06.68 Sub (R): 299 15.70 Sub (V): 237 12.47

Question #1: Interpretation? My instinct is to subtract Nothing from the rest and use that as the actual base, thereby ignoring the overhead of Benchmark and the for loops.

Question #2: Those for loops: I want to get rid of cacheing in the hash and arrays. Is that a valid concern? Are the for loops a valid solution?

Question #3: Pass by reference is slower than Pass by value. Is that to be expected in this case? I thought bouncing things off the stack was supposed to be slow.

Question #4: What have I forgotten that completely invalidates my whole process?

-Lexicon

Replies are listed 'Best First'.
Re: Fundamental Benchmarks
by grinder (Bishop) on Apr 02, 2001 at 13:25 UTC

    A quick couple of points

    • don't use $a and $b as scratch variables anytime, as they are special... just ask sort. Bad habit to get into... sooner or later you'll get bit.
    • Use $trash = $hash{$_} instead of $trash = $hash{"$_"}. (or $trash = $array["$_"]... whatever floats your boat), but be consistent.
    • use Inline.

    --
    g r i n d e r
Re: Fundamental Benchmarks
by kal (Hermit) on Apr 02, 2001 at 12:58 UTC

    Without wanting to mention the usual 'benchmark', 'salt', 'pinch of', etc., it's worth noting that you seem to be aiming for implementation optimization rather than algorithmic optimization (although perhaps you have done the algorithmic, and are just currently interested in the implementation).

    To be honest, I'm not sure I'd bother - your 'optimizations' are dependent on how a particular perl implements a feature, which isn't necessarily going to be consistent across versions of perl, and what might be fast now might be slow in the future.

    As an example, take all those assembly (spit) programmers, who for years have shaved the time off a multiply instruction by barrel-shifting powers of two. Along comes the P4, with no barrel shift, and suddenly their 'optimization' damages performance like you would not believe.

    I know it's not a fantastic answer, but don't bother with implementation optimization - read Knuth, improve your algorithms, and attain bliss :)

      To be sure, I've done quite a bit of work on Algorithmic optimization. What was some O(n!) time at first is now O(1) or O(n) (depending on use). But certain implimentation details lead me to need to choose between a Subroutine call vs a 2D-array lookup. I hadn't realized the Array would be so fast, even with 1 million cells. Subroutine calls are far more expensive than I expected. The array implimentation will probably increase the speed of the algorithm by 2-4 times, which makes me very satisfied that I went ahead and checked.

      For my purposes, I'm fairly confident, as I imagine that a subroutine will always be a few order of magnitude slower than an Array. All this assuming that my results have any lasting meaning at all, of course. ;)

      If people would dedicate a few minutes to look at this on different systems, that would be quite sweet. This particular benchmark ran under Redhat in Perl 5.00503. The box is (I think) a K6-2 250 with 256MB Ram, but I could be wrong on that processor. I'll post an ActiveState 5.6.0 under NT tomorrow if I get a chance.

      -Lexicon

        This is on an 270 MHz IP-27 (alternatively R12000--I'm not sure which label means more), using Perl 5.004_04.
        A = 10 B = 1000000 Benchmark: timing 10 iterations of 1 Nothing , 2 Mult , 3 Array + , 4 Hash , 5 Sub (Ref), 6 Sub (Val)... 1 Nothing : 30 secs (28.53 usr 0.03 sys = 28.56 cpu) 2 Mult : 63 secs (49.93 usr 0.07 sys = 50.00 cpu) 3 Array : 42 secs (37.05 usr 0.04 sys = 37.09 cpu) 4 Hash : 99 secs (95.28 usr 0.09 sys = 95.37 cpu) 5 Sub (Ref): 127 secs (117.77 usr 0.09 sys = 117.86 cpu) 6 Sub (Val): 98 secs (93.28 usr 0.11 sys = 93.39 cpu)
        Edit: fixed careless markup (sorry)
Re: Fundamental Benchmarks
by jbert (Priest) on Apr 02, 2001 at 16:28 UTC
    Some quick thoughts:

    Firstly, its always good to see if your results stay the same if you permute the test conditions a bit. Make sure you try different numbers for a and b - perhaps graph how these methods change over different ranges.

    More interestingly, does changing the order you run your benchmarks change anything? (e.g. does the first run of a memory-hungry implementation cause the process to grab lots of VM from the OS which is then re-used more quickly by the malloc in later runs? Conversely does heap fragmentation slow down memory management for later runs?)

    Another thing to keep in mind is that the perl 'interpreter' actually runs an optimisation pass. When you put together simple sample code, things might be trivially optimised away which might not be in your production code.

    Lastly, I'm not sure all your examples are comparing like with like. I'd be very wary (vary wery?) about using microbenchmarks to suggest which approach is best for a production app. (For example, what you are attempting is a classic memory/speed trafeoff. Your test and production environments may vary widely in available memory under load.)

    Sorry if this is teaching grandma how to suck eggs, but you are better off timing/benching a real app rather than sample code.

    PS. Array lookup is a cheap operation in most languages. It is a straight memory reference. The cost is in consuming memory and how that affects end system performance.

    Lastly, if you are really caching multiplications then have you considered bit twiddling? :-)