Beefy Boxes and Bandwidth Generously Provided by pair Networks
Welcome to the Monastery
 
PerlMonks  

Re^3: Testing methodology

by tobyink (Canon)
on Mar 05, 2012 at 08:17 UTC ( [id://957853]=note: print w/replies, xml ) Need Help??


in reply to Re^2: Testing methodology
in thread Testing methodology

Re max_length. Let's say I'm writing a module Amusement::Park which has a list of Amusement::Rides. Each Amusement::Ride has an associated Q. (This is a popular amusement park.) I'm writing the Amusement::Park but have no control over the Amusement::Rides, so don't initialise the Q objects. Good amusement park management says I need to point customers at rides with the least busy queues. To do this I need to calculate length ÷ max_length.

A use case for is_full is that say I generate events of multiple levels of importance: say errors, warnings and debugging messages. If the queue is full, I might choose to silently drop debugging messages rather than being blocked. Or there might be several queues I could potentially add an item to, and I wish to choose one that is not full..

peek/peek_all - there may be several producers but only one consumer, so no danger of peeking a value but somebody else dequeueing it.

Replies are listed 'Best First'.
Re^4: Testing methodology
by BrowserUk (Patriarch) on Mar 05, 2012 at 11:17 UTC

    Okay. I'll (briefly) join you in analogy land.

    I'm writing the Amusement::Park but have no control over the Amusement::Rides, so don't initialise the Q objects.

    Then query the information from Amusement::Rides.

    If Amusement::Rides creates the queues, then good OO dictates that you shouldn't be going direct to the Qs anyway. If he chooses to expose that information to you, that's his call.

    Amusement::Park ... max_length

    If you need to try and balance your queues, then the calculation n/maxN is completely useless to you. Different rides take different times to run, and accommodate different numbers of riders. Therefore percentage fullness is an completely useless measure of flow.

    And if you try to use it to control flow, you are going to end up with the well-known supermarket checkout/motorway traffic holdup scenario of whichever queue you switch to, the others will always suddenly flow faster.

    Getting back to the real world. Even if your guests wouldn't rebel at being made to queue for the lame-ass World Showcase when they came to go to Spaceship Earth, using multiple queues is completely the wrong way -- and a rookie mistake -- to manage distributing resources to multiple consumers.

    The correct, and only workable way is a single queue. The analogy here -- which may or may not make sense wherever you are in the world -- is the bank or post-office (even some enlightened supermarkets now), queuing system. A single queue and "Go to window(checkout) 8"...

    And finally, I bring you back to the fact that the moment you get your hands on this information, it is out of date. Between the return statement in the method, and you performing your division, you can loose your time-slice on the processor. And by the time you get your next one, the queue(s) you are instrumenting may have emptied (and filled and emptied again if you are not the only producer).

    Its is a simple fact of life with trying to instrument concurrency, that as soon as you have made a measurement, it is out of date. And the worst thing you can do is to try and synchronise threads in order to try and regulate them. It is the very act of trying to turn stochastic systems into deterministic systems that creates all the "evils" of threading -- dead-locking; live-locking; priority inversion et al.

    And the 'cure' for almost all of them is queues. But to work properly, they have to be free-running, bounded and self-regulating. It is that very self-regulation that permits the guarantees -- that producers and consumers will make steady forward progress and, eventually, finish -- that allow the programmer to ignore synchronisation and locking and fairness, because they -- in conjunction with the OS scheduler -- will take care of it for him.

    And the real beauty of the free-running queue is that regardless of whether you have single consumer and multiple-producers; or vice versa; or one of each; and regardless of whether the jobs are of equal size (cpu or elapsed time); or widely disparate sizes; they will be processed in a fair and timely manor.

    About the only mistake -- beyond using multiple queues for the same work items -- is to try and regulate them.

    Other than tailoring their bounds -- the size of the queue. This is usually done very pragmatically by trying a few different sizes on small test sets to find what works best. It can be done more scientifically, but the instrumentation required is not fullness. It is: 'time spent blocking when full' versus 'time spent blocking when empty'. (Producer bound or consumer bound.) It is also possible to use these statistics to adjust the size automatically, but it requires the statistics be measured and evaluated internally (ie. under locks). The downside is that gathering the statistics imposes a considerable penalty

    A use case for is_full is ... errors, warnings and debugging .... If full, drop debugging ....

    My personal reaction is, if you can safely drop them, why are you producing them. That's not a question. The point is, don't do what you do not have to do. And once you've limited to doing only what you have to do, the option to drop goes away. So then, you are either able to run sufficient consumers to ensure that the producer doesn't block too often; or you need a bigger box; or you need to redesign your system.

    The CompSci solution to your scenario is a priority queue. These are a completely different beast to free-running queues and must be implemented in an entirely different way. Perhaps the simplest implementation is to use multiple, free-running queues, one for each level of priority. Producers queue to the appropriate queue. Consumers try the highest priority queue first and drop down until they get work.

    But in reality they do not work well at all. The trouble is that as soon as a consumer has discovered that the high priority queue is empty, it can become non-empty. But the consumer has dropped down to the lowest priority level and is working on an item that is going to take ages, and so the high priority queue starts to go unserviced and fills up. One approach to correcting this is to have the consumers poll the high-priority queue for a short while before dropping down, but it fixes nothing because Sod's Law tells you that the nanosecond it stops polling, is when the high priority job arrives.

    The archetypical example of how priority queues fail is IBM's long defunct SNA. It had priority levels for transmission packets. The trouble is, that if you allow the priority to be user elective, everybody marks their stuff with the highest priority. There is no way to establish actual relative priority -- even if you could have a constant round-table discussion between all parties, they'd never reach an agreement. So, you resort to heuristics. Small means high cos it take no time; big means low because it lots of time; at busy times big just never gets through, so then you get people (me in real life) writing programs that break the huge files I need to send up into zillions of iddy-biddy chunks so they go through fast and can be reassembled at the other end.

    The only half-good solution is to use dedicated queues for each priority and dedicated consumers for each of those queues, and manage them through OS thread priority settings. It won't stop low-priority traffic from building up, but with any modern OS with dynamic (thread) priorities, low-priority queues that are starved of cpu slowly raise their priority until they get a time-slice before dropping back thus ensuring some forward progress for all threads (queues). Any other mechanism fails to ensure that forward progress will be made by all priority levels.

    Or there might be several queues I could potentially add an item to, and I wish to choose one that is not full..

    Another rookie mistake I'm afraid.

    It is always a mistake to have multiple paths (queues) by which a work item may make its way to its consumer(s).

    If those multiple queues lead to the same (pool of) consumer(s), then the general effect of having multiples is of ensuring that one or more queues will fill and be ignored.

    How will the consumers decide which queue to draw from? Even if they pick randomly, they'll eventually pick the empty one and block when the others are all full. If they do it by fullness, all the consumers will pick the fullest queue at the same time, and the last ones in will find it emptied by those that got in first and block, despite that other queues are non empty. And the harder you try to avoid these deadlocks, the worse they will get. (Sod's Law again.)

    And if, by some mechanism, you succeed in getting both your producers and consumers to treat the multiple queues between them utterly impartially and fairly and evenly, then all you have done is create the exact effect of having a single longer queue, except FIFO is no longer guaranteed. You've broken the basic tenant of a queue.

    peek/peek_all - there may be several producers but only one consumer, so no danger of peeking a value but somebody else dequeueing it.

    Okay. What is that one consumer going to do if it peeks the queue and doesn't like what it sees there? Ignore it until it goes away? Oh. That's right. It is the only consumer, so it won't go away until it consumes it.

    So, now you need the peek_all() so that it can cherry pick the items it wants. Except that how can it get its preferred item(s) out of the queue, without removing the top item? So now you need a random access dq. Hm. That's sounding a lot like an array!

    If a consumer can tell from inspecting a dq'd item that it doesn't want to process it immediately, then *it* has the option of storing it somewhere in *its* unshared memory -- where locking isn't required for further access; where its not occupying a shared memory slot; where its decisions will have no impact upon other threads; and where it doesn't mean screwing with the simplicity and efficiency of the queue structure.


    With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.

    The start of some sanity?

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://957853]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others admiring the Monastery: (2)
As of 2024-04-20 05:43 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found