Beefy Boxes and Bandwidth Generously Provided by pair Networks
Syntactic Confectionery Delight

Re: removing the goto

by gaal (Parson)
on Jun 02, 2007 at 07:14 UTC ( [id://618880] : note . print w/replies, xml ) Need Help??

in reply to removing the goto

You already have good solutions, I'll just suggest an approach that works very similarly but is spelled out in different style.

# I'll call @weighteddiv "@input" here, and $maxclients "$choose_count +". { my %selected_set; my $choose_one = sub { $selected_set{ rand(@input) } = 1) }; $choose_one->() while keys %selected_set < $choose_count; my @selected = @input[ keys %selected_set ]; }

This uses a new variable to remember which indexes had already been selected. It doesn't shuffle the input so it's less expensive up front but risks running forever if the random selections are unlucky (or indeed if you ask for more selections than the input contains). In practical terms, this means you should choose this strategy when @input is large and $choose_count is small.

You can of course inline the choose_one function, since it's so simple and only used once. I put it there for clarity.

Note that your rand never selected the last element of the input.

Replies are listed 'Best First'.
Re^2: removing the goto
by TimToady (Parson) on Jun 02, 2007 at 17:00 UTC
    Hmm, that's not quite right. You've not eliminated the chance of duplicate values in the output. You want something more like:
    { my %selected_set; my $choose_one = sub { $selected_set{ @input[rand @input] } = 1) } +; $choose_one->() while keys %selected_set < $choose_count; my @selected = keys %selected_set; }
    The problem of non-termination is indeed something that will bite you when you least expect it. I believe it can only be solved probabilistically in the absence of a complete scan of the input, either by shuffle or by calculating a histogram of the set of input values somehow. In a workflow situation I'd probably try to get the histogram precalculated for me, and then you can actually use the numerical weights to make your selection, since this scales better to large weights than duplicating input.

    So in the absence of that kind of knowledge, I see two ways of reducing the probability of a hang. First way is to use a shuffle for small datasets and random selection for large datasets, where small/large division can be arbitrary, or determined dynamically by scanning the front of the dataset to make sure there are "enough" different values.

    The second probabilistic method is to count how many times you've made a random selection, and give up if the number of attempts far outweighs the number of desired values (and maybe print a warning, so you know why your program now takes five seconds to run instead of five microseconds). But running for five seconds and producing some output is a lot better than running forever and producing no output.

      I was under the misapprehension that the input elements were guaranteed to be distinct. But seeing your reply I reread the OP and saw that was indeed not the case. Oops! In the absence of such a guarantee of course I agree with all your observations. Now, since my suggestion was in the first place more a stylistic one and not an algorithmic one, I'm interested in following up on how to code higher-level stuff that observes the running program and aborts and possibly switches attempted computations, how to do this with the minimum uglification of existing implementations, and reasonably efficiently.

      Are there patterns here that you as language designer can see? For example, in a pure world, it's quite reasonable to say "fork off a thread with strategy #1, and give it $x seconds to run. If it fails emit the warning and move off to strategy #2". This is what you suggested except in a pure language what "fork off a thread" means under the hood kind of doesn't matter. It may be an OS thread, or something else -- you know the side effects are contained, and the computation can be killed at any time. Does it make sense to tag a function with "this should never take more than [some amount of CPU / RAM / other resource]"?

      I also wonder if there's any commonality to expressing "tips" about the data, but I guess there isn't. (Like for example if here whoever gave us the data knew there were duplicates, but only very few, how would they express that.)

      Sorry for rambling a bit, I should get some sleep...