the sorry state of Perl unit testing framework

Updated: Test::Tiny was benchmarked and analyzed, Test::More alpha release 1.301001_101 with new backend tried

Soon I will be going to QAH 2015 in Berlin. My number one goal is to get parallel testing for TAP::Harness working on Win32 Perl but this post isn't about parallel testing. TAP::Harness stymied me many times trying to touch the codebase. It is militantly OOP, and obfuscated with its own implementation of method dispatch, and is written in a "declarative language" similar to a makefile. It has layers of faux-abstraction, that superficially purports to support pluggability, yet doesn't allow anything but the current implementation to fit the abstraction layers. I belive it is a pedagogical exercise in a ivory tower, which in plain terms, it is a homework assignment for appealing to a professor's ego that shows off all the skills itemized on your syllabus.

Researching the bloated design of TAP::Harness also brings me to investigate the other side of the TAP connection, Test::Simple/Test::More. It is equally inefficient I discovered.

All these tests were done a 2.6 ghz 2 core machine, running 32 bit WinXP. The primary tool I use for benchmarking in this post is called timeit.exe. It is a simples times() like benchmark tool that asks the NT kernel for its counters after the process exits. The resolution of these counters is 15.625 ms but the workloads I use to benchmark take seconds or minutes to complete, so 15.625 ms resolution isn't an issue.

The workload I use is always 1 million tests, the tests are run by GenTAP.pm, which is from http://github.com/bulk88/Win32-APipe. GenTAP.pm and fastprint.t, fasttinyok.t and fastok.t should be portable and run on any Perl platform if you want to try reproducing these benchmarks yourself.

I refrained from using nytprof in this write up since nytprof has overhead, and questioning individual subs, and lines of code, is pointless, if slowness is a conscious systemic design rule, not a couple bad drive-by patches over the years. The output is redirected to a file, since this way there is no overhead of the Win32 console in writing to STDOUT.

Test::More

fastok.t calls Test::More's ok() from version 1.001014, 1 million times in a loop with each test always passing, and a randomly randomized test name.

timeit perl t\t\fastok.t > t.txt

Version Number:   Windows NT 5.1 (Build 2600)
Exit Time:        2:11 am, Saturday, April 4 2015
Elapsed Time:     0:02:00.671
Process Time:     0:01:59.234
System Calls:     3664127
Context Switches: 320686
Page Faults:      948528
Bytes Read:       3339101868
Bytes Written:    100048020
Bytes Other:      73765993
[download]

which leads to 0.000119 seconds to do 1 ok() call, or 0.1 milliseconds. This is signifigant, for each 10000 tests, 1 second of overhead. It also means, you can't run more than 10000 tests per second on per core no matter what you do. How often do you run "make test" and wait for it to finish and it feels like filling a gas tank, how slow is travis or whatever CI solution you use?

If your unit testing consists of code generated permutations of parameters to your module, 10,000 tests or 100,000 is very easily reachable on 1 software project/module. Some very popular CPAN modules do this style of code generated permutations of tests.

Now what could be the fastest possible TAP generation? "type file_of_tap.txt" or "cat file_of_tap.txt" is cheating. Under the same conditions, which is the best case scenario for the overhead of test to compare Test::More against, is a simple "print "ok ".(++$counter)." - $testname\n";" in a loop instead of "ok(1, $testname)", which is what fastprint.t does.

timeit perl t\t\fastprint.t > t.txt

Version Number:   Windows NT 5.1 (Build 2600)
Exit Time:        2:19 am, Saturday, April 4 2015
Elapsed Time:     0:00:02.156
Process Time:     0:00:02.109
System Calls:     21413
Context Switches: 6818
Page Faults:      14663
Bytes Read:       270297
Bytes Written:    60606608
Bytes Other:      6971963
[download]

0.0000021 seconds per the DIY ok().

0.000119/0.0000021 57x more time. FIFTY SEVEN times more CPU. To summarize, if you use Test::More, you might as well imagine there is an gig ethernet cable and a UDP socket between your TAP emitter and TAP consumer.

Now about memory use of Test::More. I modified fastok.t as such

unshift(@INC, '.');
require 't/t/GenTAP.pm';
require Test::More; #load but dont call anything in Test::More
#we want runtime mem overlead, not loadtime, otherwise the require wil
+l happen
#inside GenTAP if it isn't done here
system('pause');#sample memory
GenTAP(0, 0, 'ok', 1000000);
system('pause');#sample memory
[download]

before 3,828KB, after 397,828 KB, peak 397,840 KB.

(397840-3828)/1000000=0.394012 KB per test emitted. 400 bytes per test. What on earth is in those 400 bytes? My test name passed to T::M::ok() is always 42 bytes long. Lets round that to the next 16 byte win32 malloc boundary, 48+12(perl's win32 ithread malloc wrapper in vmem.h)+16(sv head)+8(svpv body)+4(SV * somewhere else)=88 bytes for storing the test name. Where did the other 300 bytes go? Why is Test::More saving the names of passing tests? showing off your skills in LOC per hour for your CV? writing job-for-life unmaintable code? The TAP parser is responsible for maintain TAP state, not the TAP emitter. A TAP emitter should have no memory increase between sucessive calls to ok().

Test::More has no competitors on CPAN except for Test::Tiny which makes no attempt at API compatibility but has a similar ok() sub. So using fasttinyok.t, which calls Test::Tiny's ok() sub 1 million times, I get

timeit perl t\t\fasttinyok.t > t.txt

Version Number:   Windows NT 5.1 (Build 2600)
Exit Time:        7:42 pm, Tuesday, April 7 2015
Elapsed Time:     0:00:05.218
Process Time:     0:00:05.140
System Calls:     49612
Context Switches: 24005
Page Faults:      17396
Bytes Read:       146216
Bytes Written:    57859498
Bytes Other:      17639334
[download]

Test::Tiny's ok() is (5.140/2.109=2.437) 2.4x slower than my ideal DIY ok() implementation. which compared to Test::More's 57x slower, 2.4x is a rounding error. Test::Tiny is a real working CPAN module remember.

About Test::Tiny's memory usage, using the same breakpoint positions, before 3,008KB, after 3,028KB, 3028-3008=20 KB. 20000/1000000=0.02 bytes per test, which means unmeasurable small. Basically no increase in memory usage, unlike the 100s of MBs seen with Test::More.

Even a drop in replacement for Test::More that is 10x slower than the ideal DIY ok() implementation above, or in other words 4x slower than Test::Tiny, is still 5x times faster than Test::More. Just about anything is faster than a sloth pulling a wagon. Something needs to be done about Test::More, the entire perl community relies on it, and it is unworkably slow. Either a drop in replacement, or replacing all of the internals of Test::More with a simplied architecture.

I was told to try a alpha release (1.301001101) of Test::More which included a new backend which hoped to improve its performance. I will therefore benchmark it.

timeit perl t\t\fastok.t > t.txt

Version Number:   Windows NT 5.1 (Build 2600)
Exit Time:        11:37 pm, Tuesday, April 14 2015
Elapsed Time:     0:02:10.859
Process Time:     0:02:09.375
System Calls:     2399091
Context Switches: 238722
Page Faults:      543275
Bytes Read:       3031284
Bytes Written:    103643867
Bytes Other:      87114348
[download]

The results are bad. 10 seconds more old stable 1.001014 Test::More, or 9% slower.

TAP::Harness

Onto the TAP consumer, TAP::Harness. For the next example, remember fastprint.t takes 2 seconds of CPU to print its 1 million tests. I dont think fastprint.t's process time is included by timeit.exe tool but with the numbers shown, 2 seconds is a rounding error if it is included.

C:\sources\Win32-APipe>timeit C:\perl521\bin\perl.exe "-MExtUtils::Com
+mand::MM"
"-MTest::Harness" "-e" "undef *Test::Harness::Switches; test_harness(0
+, 'blib\li
b', 'blib\arch')" t\t\fastprint.t
t\t\fastprint.t .. ok
All tests successful.
Files=1, Tests=1000000, 66 wallclock secs (65.86 usr +  0.13 sys = 65.
+98 CPU)
Result: PASS

Version Number:   Windows NT 5.1 (Build 2600)
Exit Time:        3:00 am, Saturday, April 4 2015
Elapsed Time:     0:01:06.406
Process Time:     0:01:06.203
System Calls:     483839
Context Switches: 182749
Page Faults:      78950
Bytes Read:       62566241
Bytes Written:    53961404
Bytes Other:      11595189

C:\sources\Win32-APipe>
[download]

(60+6)/1000000 = 0.000066 seconds for TAP::Harness to process 1 TAP test. It is better than Test::More, with TAP::Harness taking, for parsing 1 test, 55% of the time it takes, for Test::More to emit 1 test.

Now about the memory usage of TAP::Harness. Checking the process memory with Task Manager with a breakpoint right before "test_harness" sub is called, shows 5,868 KB, Windows OS shows the process peaked at 106,368 KB, and a breakpoint right after test_harness sub, shows 96,636 KB. There are 2 memory problems here that need to be broken down.

Problem 1, TAP::Harness uses about 100 bytes of memory for each *passing* test ((106368000-5868000)/1000000). The internal state of tests results isn't a sparse array or linked list of failed tests, or a vec() or C bitfield, heck, it isn't even a @array with undef/uninit slices for successful tests, which would be 4 bytes per test on a 32bit OS. It is 100 bytes per passing test. What is 100 bytes? 100/4 is 25 pointers/slice members. For reference, each scalar you create, on 32 bit OSes, is 4 slice members. And I will guess it uses a blessed hash with 1 or 2 hash keys for each test without even looking at TAP::Harness's implementation.

Problem 2, in the breakpoint, after test_harness() was executed, memory dropped from 106,368 KB to 96,636 KB. Only about 10MB. What is inside the 96,636-5,868=90,768KB?

Here is the console log with the breakpoints (the "Press any key to continue . . ." lines) to show where memory was sampled.

C:\sources\Win32-APipe>C:\perl521\bin\perl.exe "-MExtUtils::Command::M
+M" "-MTest
::Harness" "-e" "undef *Test::Harness::Switches; system 'pause'; test_
+harness(0,
 'blib\lib', 'blib\arch'); system 'pause';" t\t\fastprint.t
Press any key to continue . . .
t\t\fastprint.t .. ok
All tests successful.
Files=1, Tests=1000000, 65 wallclock secs (65.05 usr +  0.11 sys = 65.
+16 CPU)
Result: PASS
Press any key to continue . . .

C:\sources\Win32-APipe>
[download]

Notice

All tests successful.
Files=1, Tests=1000000, 65 wallclock secs (65.05 usr +  0.11 sys = 65.
+16 CPU)
Result: PASS
[download]

was already printed, so what do those 90 MB contain? Why is TAP::Harness holding onto state after printing the final line? surely this can't all be malloc fragmentation preventing a release of memory? Or was TAP::Harness written to leak memory since "the process will exit soon, dont waste CPU freeing anything", perhaps a legitimate reason? I doubt that was the intention of the person who designed TAP::Harness's OOP API.

In combination, TAP::Harness+Test::More take 0.185 ms per ok() call, of overhead. is() which is more common than ok(), will probably take even more than 0.185ms, so 0.185 ms per test, is the current best case scenario using the existing unit testing framework.

Solutions

Test::More

Rewrite Test::Simple and Test::More back into standalone modules like they were 20 years ago, and remove their usage of Test::Builder would be best, but that requires Test::More's authors and maintainers to agree that the code is deeply flawed and agree to replacing it.

Summarizing some code in Test::Builder, do we really need to ever implement this, anywhere?

sub in_todo {
    my($todo, $caller);
    $todo = defined((caller++$caller)[0])?(caller$caller)[0]->can('is_
+todo')?(caller$caller)[0]->is_todo?1:undef:undef:0 until defined $tod
+o;
    $todo;
}
[download]

Since I expect signifigant protests from the peoples whoses CVs depend on protecting their precious snowflakes (see this incredulous post https://github.com/Test-More/test-more/issues/252 from 2012, leaking memory is by design and will never change by the then author), a drop in replacement for Test::More under a different namespace and patching dists away from T::M/T::B, and removing T::M/T::B from Perl 5 core is probably the easiest way forward.

TAP::Harness

Rewriting TAP::Harness from scratch is probably the only solution, since a couple 3rd party modules are crazy enough to integrate themselves, like TAP::Harness::Archive. The typical "make test" has no use for TAP::Harness OOP bloat, with the only 2 options being TEST_VERBOSE on or off, and parallel or not.

I have done nytprof-ing of TAP::Harness, but nothing is fixable there without admitting all the design rules are a list of what not to do.

A simple design for a new harness would be, a TAP source class (usually a process class) that returns a stream class. The stream class returns a string name() (filename of .t, or a http:// URL of TAP or a disk path of TAP), and returns multi-KB blocks and eventually undef as EOS indicator, just 2 methods. For passing tests store nothing (undef), or store a "sparse" range of passing tests unblessed objects, store failed tests and unknown lines and diag in a linked list for dumping/summing at the end. Rewrite the parser in XS to quickly search for newline and "not ok" and "ok" token, maybe use BM algo. Even for a PP version, index and substr for a 1 pass parser through the block. For passed tests, store nothing. If a TAP stream has all passing tests, and reaches the end of stream, all the passing tests are represented by 1 hash with 2 keys (start range, end range). This is a long shot, but ideally, pipes shouldn't even be used between a TAP consumer and TAP emmitter. Future Test::More can communicate through shared memory through XS to future TAP::Harness, with the IPC buffer maybe looking like a stream, or the TAP "parsing" (no TAP is generated) is "done" (there is no TAP, just an array of test records) on the client side in Test::More, which also gives the benefit of "out of sequence" TAP being impossible due to perl thread safety in Test::More::XS.

Why is TAP::Harness's design flawed? I saw all of this being done with stepping and nytprof inside TAP::Harness.

Things not do include, no method accessors, callbacks, method dispatchers in PP, declarative syntax, no pluggable tokenizers, no roll your own RTTI, not RTTI at all, just 3 classes, and absolutly no "Harness::Data::Integer" class since Perl isnt Javascript, and will not JIT your Integer class into a scalar, always use hash keys, do not use classes and constructors if you can use an integer bitfields or integer bools/!!0, this is Perl, not C++, not Java. Do not build bool methods that should be bitfields, that are aggregations of other bool methods, since you wind up exponential method calls, and there is no caching since caching is evil since it cant be plugged later on according the OOP dogma, so the result is to parse a TAP line over and over. Do not use "class factory" classes. There is no reason to have closured, dynamically generated, anonymous classes. If you need 2 classes, both named "Record" since you are too incompetent to prefix "Customer::" or "Inventory::" to the word "Record", you should name classes after your pets, I hope your house doesn't have 2 Rustys. Perl has "packages", do not invent our own. Do not nest hashes inside hashes. Perl hashes aren't C structs where all the "."s in "pointer->a.b.c.d" optimized away. Do not write classes, where the ctor, does nothing except blesses an empty hash, and then each method of the class checks if the object was "really ctored" and conditionally calls the real ctor, in some crazy attempt to optimize for bad callers that unnecessarily create objects, never call a single method on them, then dtor them. Do not ask an object if it can() do anything, that means your objects are lying as to their abstract base class. If you bought a car on ebay, and the seller mails you 4 tires with a $1K shipping charge, do you timidly do nothing and buy another car online? Do not implement has_pending() with return !!scalar(shift->pending()); where pending() is return map{ref($_)->new($_)} @{shift->{queue}}. Do not implement meaningless sort calls, such as in $new->{$_} = $self->{$_} foreach sort @clone;. Do not collect error/fail state diagnostics, build error message strings, when there is no error, and you will never use that error message or state snapshot again.

For archival reasons, since Win32-APipe is a temporary name, and will be deleted soon. GenTAP.pm.

use strict;
use warnings;
use constant UVMAX => 2**32; #%Config doesn't have this, this will do
use Time::HiRes qw( sleep );

#void GenTap($time_per_sleep, $sleep_count, $emitter, $test_count)
# time_per_sleep -
#  seconds to sleep when a sleep is triggered, floating point okay and
+ will sleep
#  fractions of a second (uses Time::HiRes)
# sleep_count -
#  integer count of number of times to sleep, the sleeps will be evenl
+y
#  distributed between tests
# emitter -
#   ok - use ok()
#   print - use perl core print()
#   block - use perl core print with atleast 4KB of multiline output p
+er print
# test_count -
#   number of tests to run, will be interleaved with sleeps if applica
+ble
sub GenTAP {
    die "usage: GenTAP" if @_ != 4;
    my ($time_per_sleep, $sleep_count, $emitter, $test_count) = @_;
    die "invalid test count" if $test_count >= UVMAX;
    my($testnum, $tests_before_sleep, $tests_left_before_sleep, $buffe
+r, $ok) = (1, UVMAX, UVMAX);

#emmit a plan
    if($emitter eq 'ok') {
        require Test::More;
        Test::More::plan(tests => $test_count);
        $ok = \&Test::More::ok;
    } elsif($emitter eq 'tinyok') {
        require Test::Tiny;
        Test::Tiny->import(tests => $test_count);
        $ok = \&Test::Tiny::ok;
        $emitter = 'ok'; # less branchin in main loop
    } else {
        my $plan = "1..$test_count\n";
        if($emitter eq 'print') {
            print $plan;
        } elsif ($emitter eq 'block') {
            $buffer = $plan;
        } else {
            die "unknown emitter";        
        }
    }
    
    $tests_left_before_sleep = $tests_before_sleep = $test_count / $sl
+eep_count if $sleep_count;
    
    while ($testnum <= $test_count) {
# UC 1 random character in each test name for a touch of crazy
# and test output that isn't identical each time, UCing a space produc
+e no change
# which is intentional, since sometimes the line should not be "typo-e
+d"
        my $testname =  "All work and no play makes Jack a dull boy";
        my $replace = int(rand(length("All work and no play makes Jack
+ a dull boy")));
        substr($testname, $replace, 1, uc(substr($testname, $replace, 
+1)));
        if ($emitter eq 'ok') {
            &$ok(1, $testname);
        } else {
            my $testline = 'ok '.$testnum.' '.$testname."\n";
            if ($emitter eq 'print') {
                print $testline;
            #} elsif ($emitter eq 'block') {
            } else { #checked above
                $buffer .= $testline;
                if(length($buffer) > 4096) {
                    print $buffer;
                    $buffer = '';
                }
            }
        }
        sleep($time_per_sleep), ($tests_left_before_sleep = $tests_bef
+ore_sleep) if --$tests_left_before_sleep == 0;
        $testnum++;
    }
    
    #flush the buffer in block mode
    print $buffer if $emitter eq 'block' && length($buffer);
}
1;
[download]

fastprint.t

unshift(@INC, '.');
require 't/t/GenTAP.pm';
GenTAP(0, 0, 'print', 1000000);
[download]

fastok.t

unshift(@INC, '.');
require 't/t/GenTAP.pm';
GenTAP(0, 0, 'ok', 1000000);
[download]

fasttinyok.t

unshift(@INC, '.');
require 't/t/GenTAP.pm';
GenTAP(0, 0, 'tinyok', 1000000);
[download]

Back to Meditations