Re^3: Testing methodology (TAP++)

Replies are listed 'Best First'.
Re^4: Testing methodology (UPDATED!) by BrowserUk (Patriarch) on Mar 06, 2012 at 13:24 UTC
There are tons of tools built on top of TAP. I just don't get what people get from TAP. As a (module/application) user, I don't give a monkeys what passed or failed. Either it passed or it didn't. Nor do I (as a Win32 user) give a flying fig for whether you skipped a thousand tests because I'm not on nix. As a (module/application) programmer, if 90% passed is acceptable, then 10% of the tests are useless. If I wrapped ok() around my 'has this value been dequeued before' test, I'd be producing 100,000 (or a 1,000,000 or 100,000,000) OKs. Even if the user has configured a tool to suppress or summarise that useless information, it still means 100,000 (...) calls to a function to produce useless output; and 100,000 ( ... ) IOs to the screen or pipe, and 100,000 ( ... ) checks in the harness to throw away what I don't want to start with. My testing therefore takes 10 times as long for no benefit. Why do you care about the performance of tests, I can hear some somebodies asking -- especially as I dissed their time/cpu usage statistics. But the problem is, IO goes through the kernel and is (often) serialised. And that completely screws with the statical legitimacy of my testing strategy. I have at least half a dozen different implementations of a bounded Q. Some pure perl like this one. Some (in XS) that bypass Perl's Win32 emulation of nix cond_* calls and use (Win32) kernel locking and synching constructs direct. Some (in C/assembler) that bypass even those and implement locking using cpu primitives. Many of them are, or have been at some points, incorrectly coded and will deadlock or live lock. But in almost every case when that happens, if I introduce a few `printf()`'s into the key routines, they perform perfectly. Until I remove them again or (for example) redirect that trace output to NULL. And then they lock again. The reason is that the multi-threaded C-runtime performs it own internal locking to prevent it from corrupting its own internal structures. And those locks can and do prevent the timing conditions that cause the hangs. So, for me at least, not only do I not see any benefit in what TAP does, the output it requires can completely corrupt my testing. It is actually useful in the larger context for each individual test to get numbered so we can often correlate different failure scenarios and to make concise reports easy. As the developer receiving an error report, the first thing I'm going to want to do is convert the 'test number' to the file/linenumber. Why bother producing test numbers in the first place? Just give the user file&line and have him give that back to me. The only plausible benefit would be if the test number were somehow unique. That is, if the number of the test didn't change when new tests were added or old ones were removed. Then I might be able to respond to reports from old versions. But that isn't the case. And we have more than one test file per code file in many cases. This is especially useful when there are interesting set-up steps required for some tests. Hm. Unit tests, test the unit. System, integration and regression tests are different and live in a different place. I'm having a hard time envisaging the requirement for "interesting set-ups" for unit testing. Many of my test files abstract a few patterns of test and then run lots of simple tests that are specified with a small amount of data. Isn't that exactly what my 'has this value been dequeued before' test is doing? (I re-read the para many times and I'm still unsure what you mean?) `my $bits :shared = chr(0); $bits x= $N/ 8 + 1; my $t = async{ while( defined( $_ = $Qn_1->dq ) ) { die "value duplicated" if vec( $bits, $_, 1 ); vec( $bits, $_, 1 ) = 1; } };` [download] I see no benefit at all in counting those as individual tests. Much less in allowing the test suite to continue so that the one failure gets lost in the flood of 99,999: `D'ok 1 - got 1 from queue D'ok 2 - got 2 from queue D'ok 3 - got 3 from queue D'ok 4 - got 4 from queue D'ok 5 - got 5 from queue D'ok 6 - got 6 from queue D'ok 7 - got 7 from queue D'ok 8 - got 8 from queue D'ok 9 - got 9 from queue ... D'ok 99996 - got 99996 from queue D'ok 99997 - got 99997 from queue D'ok 99998 - got 99998 from queue D'ok 99999 - got 99999 from queue D'ok 100000 - got 100000 from queue` [download] (D'oh! Preachin' agin. Sorry! :) Also, having the test code in the same file as the code being tested would complicate coverage measurement, Maybe legit on a large collaborative project. But I still maintain that if I need a tool to verify my coverage, the module is too damn big. Update: split this quote out from the previous one; and responded separately easily distinguishing commits that are fixing code from commits that are fixing tests, searching for real uses of a specific feature while ignoring tests that make use of it, ... And I do not see the distinction here either. Test code is code. You have to write it, test it and maintain it. The bug fix that fixed the incorrectly coded test that was reporting spurious errors, is just as legitimate and important as the one that fixed the code under test that was reporting legitimate errors. Treating them in some way (actually, anyway) different is a nonsense. And this, (dare I say it?), is my biggest problem with TDD:"The Franchise". It actively encourages and rewards the writing of reams and reams of non-production code. And in most cases, does not factor that code into the costs and value of the production product. Try going to your National Project Coordinator, (due in Parliament the following week to explain to the Prime Minister why the project is late and over budget), that the reason everything worked during in-house testing and went belly-up on the first day during the high-profile, closely monitored, �18 million pilot study, was because all the in-house tests had been run with the debug-logging enabled, and that so completely distorted the timing requirements that nobody believed you that in critical areas, the overzealous use of over-engineered OO techniques meant that there was no way that it could keep up with full-production scale loading. The logging was effectively serialising inbound state changes, so nothing broke. But, no, I'm not interested in "stepping up" to your challenge. From what you've said about (at least some) of the test tools I'm critiquing, you would not have been the right 'big gun' for my purpose anyway. Many of my reasons would just come across as a personal attack so I'll not go into them. That is a shame. (For me!) I don't feel that I respond 'hurt' to critiques of my code. I may argue with conclusions and interpretations; but (I like to think), because of my disagreement with your technical assessment of that code. But when you start pseudo-psychoanalysing me on the basis of my code -- or words -- and start attributing their deficiencies (as you see them) to some personality trait indicative of some inherited mental condition, rather than as typos, misunderstandings or dog forbid, mistakes I will take umbrage and will respond in kind. This is where we have always clashed. But most of what I'm talking about I can't demonstrate well by pasting a bit of code. I have no interest in trying such a feat. And, as is so often the case, the most interesting part of your response leaves me with a million questions and wanting more... With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday' Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. "Science is about questioning the status quo. Questioning authority". In the absence of evidence, opinion is indistinguishable from prejudice. The start of some sanity?	[reply] [d/l] [select]
Re^5: Testing methodology (TAP++) by tye (Sage) on Mar 06, 2012 at 15:32 UTC
If I wrapped ok() around my 'has this value been dequeued before' test, I'd be producing 100,000 (or a 1,000,000 or 100,000,000) OKs. What a stupid idea. And not one that I saw anybody suggest. Hm. Unit tests, test the unit. System, integration and regression tests are different and live in a different place. I'm having a hard time envisaging the requirement for "interesting set-ups" for unit testing. Well, I guess you haven't done much interesting unit testing? As I said, testing leaf modules is relatively trivial. Unit testing non-leaf modules can get tricky and there can be interesting set-up required so you can mock out things that the non-leaf module depends on so that you test the unit, not the whole system it employs. But even something relatively trivial and a leaf module like File::FindLib required several interesting set-ups that would be a huge pain to do from a single file. So I'm not particularly moved by your failure of imagination on that point. Many of my test files abstract a few patterns of test and then run lots of simple tests that are specified with a small amount of data. Isn't that exactly what my 'has this value been dequeued before' test is doing? (I re-read the para many times and I'm still unsure what you mean?) Perhaps you should have moved on to the next sentence? "So, for example, I might have a few dozen lines where each line specifies an expected return value, a method name, and an argument list (and maybe a test description)." That is so very much not "call the same method 10,000 times expecting the same result each time, reporting 'ok' separately for each call". I don't see how one can confuse the two so I won't waste time trying to restate that. But then, I don't really consider what you keep talking about as a unit test. It is a functional test that has no reproducibility and relies on external interruptions in hopes of randomly inducing a problem. Yes, if I had to write a thread queue module, it is a test I would run but it would not be in the main "unit tests". The unit tests would cover the parts of the unit that can be tested in a controlled manner. Does trying to dequeue from an empty queue block? Does trying to enqueue to a full queue block? If I enqueue two items, do they dequeue in the expected order? They'd catch those common off-by-one errors, for example. Part of the point of the unit tests is that they get run nightly and whenever a commit is pushed and before a revision gets passed to QA. A test that requires a CD be played so we get some real hardware interrupts isn't usually part of that mix. And, yes, I have written thread queues and written tests for them. And, in trying to test for the interesting failures particular to such code, I wrote tests similar to what you wrote, tests that more resemble a load test than a unit test. But, in my experience, the load-like tests were pretty useless at finding bugs, even when running lots of other processes to try to add random interruptions to the mix. Running multiple types of load mixes with no failures would not mean that we wouldn't run into bugs in Production. And a bug being introduced was usually more usefully pointed out by the reproducible tests than by the "when I run the load test it fails". Unit testing sucks at helping with the interesting failures of things like thread queues. But that also goes back to why I don't use threads much any more. I prefer to use other means that have the benefit of being easier to test reliably. But I still maintain that if I need a tool to verify my coverage, the module is too damn big. I don't technically need a coverage tool to tell me which parts of one module isn't covered. But it is very convenient. And, yes, it saves a ton of work when dealing with hundreds of modules that a dozen developers are changing every day. Coverage just provides useful reminders about specific lines of code or specific subroutines that got completely missed by the test suite (sometimes developers get rushed, as hard as that is to imagine) and nothing more. I rarely have enough time on my hands that I consider reading through hundreds of modules and hundreds of unit tests trying to notice which parts of the former got missed by the latter. And when I'm working on a tiny leaf module, I still figure out "which part did I not test at all yet" by running a command that takes maybe a few seconds to tell me rather than taking a minute or few to swap in every tiny feature of the module and every tiny step tested and perform a set difference in my head. Particular bad ideas with coverage include: thinking that 100% (or 99%) coverage really means that you've got good test coverage; shooting for 100% (or 99%) coverage as a goal in itself rather than as a tool for pointing out specific missed spots that either shouldn't be tested or should be carefully considered for how they should be tested; adding a stupid test because it causes a particular line to get 'covered'. Update's response: And I do not see the distinction here either. Test code is code. You have to write it, test it and maintain it. No, I don't write tests for my test code. I run my test code. That has the side effect of testing that the test code actually runs. If you call that testing the test code, then you must have some confusing conversations. And I don't have to maintain code if it isn't being used any longer. But all of my code is used by at least one test file. So, if I don't distinguish then I can't tell that a feature is no longer used (other than by being tested) and so can just be dropped. And no change to test code is going to cause a failure in Production nor surprise a customer. So who cares about which changes and in what ways is very different between changes to real code and changes to test code. - tye	[reply]
Re^6: Testing methodology (TAP++) by BrowserUk (Patriarch) on Mar 06, 2012 at 17:09 UTC
Well, I guess you haven't done much interesting unit testing? Of course. That explains it. (yes. I can be just as sarcastic and dismissive as you. You know that. Why go there? <smaller>Knew it was too good to last.</smaller>) But then, I don't really consider what you keep talking about as a unit test. It is a functional test ... I tried to find definitions of 'unit testing' & 'functional verification testing' that I thought we might both agree on. As is, I couldn't find any from a single source that I could agree with. And cherry picking two from different sources to make a point would be pointless. So, I'll state my contention in my terms and let you disagree with it in yours. Your style of unit testing -- in my terms; laborious, verbose and disjointed -- will not discover anything that my style of unit testing -- in your terms perhaps; functional verification -- will fail to highlight. But my style of UT will discover every failure that your style might. And much, much more. Therefore, your style of UT is incomplete without some of my style of UT. Therefore, your style of UT is redundant. A cost for no benefit. Make-work. You will (have) argue that your unit tests help you track down trivial programming errors -- your cited example off-by-one errors. My contention is that, with the right configuration, my style of UT allows me to track them down just as effectively. Eg. `C:\test>perl async\Q.pm -N=10 -T=2 -SIZE=10 1 2 3 4 5 6 7 8 9 10 10 items by 2 threads via three Qs size 10 in 0.030158 seconds` [download] I added a single print to the dq loop. (Actually put back; as it was there to start and was removed once proven.) And I configured the test for 2 threads. Which means that each of the two "pools" gets one each. Thus, the ordering from Q1_n via Qn_n and Qn_1 is deterministic. So, I started with the simple case, and only increased the workload once the basic functionality was working. I removed the print to kill the (now redundant) noise. One set of tests to write (and maintain!) that serves both purposes. Cheaper and more cost effective. And here is the kicker. The code I posted contains a (quite serious) bug -- put back especially for the purpose. And the challenge -- which you won't take -- is that no amount of your style of UT testing will ever detect it! My style of UT makes it trivial to find. (And no, it is not a subtle timing issue or freak lock up or anything else that you can blame on "threading"). Just a plain ol' coding bug. Betcha can't? (Know you won't! :) With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday' Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. "Science is about questioning the status quo. Questioning authority". In the absence of evidence, opinion is indistinguishable from prejudice. The start of some sanity?	[reply] [d/l]


Your skill will accomplish what the force of many cannot
	PerlMonks

Re^3: Testing methodology (TAP++)

And here is the kicker.