Beefy Boxes and Bandwidth Generously Provided by pair Networks
Pathologically Eclectic Rubbish Lister
 
PerlMonks  

Average Price Algorithm

by camelcom (Sexton)
on Jan 28, 2009 at 08:51 UTC ( [id://739455]=perlquestion: print w/replies, xml ) Need Help??

camelcom has asked for the wisdom of the Perl Monks concerning the following question:

Most esteemed monks...

I humbly ask for guidance in the following matter...

I buy some commodity at the following prices:

Qty @ Price
5 @ 93.8
20 @ 93.81
10 @ 93.82
15 @ 93.83
25 @ 93.84
5 @ 93.85
20 @ 93.87
5 @ 94
35 @ 94.1
10 @ 94.2

The average of these (150) fragments is 93.92666667.

These fragments need to be allocated to 5 people in the following quantities:

Person/Qty
A/65
B/12
C/24
D/19
E/30

How do I allocate the fragments as fairly as possible such that the average price for each person is closest to the overall average?

Replies are listed 'Best First'.
Re: Average Price Algorithm
by moritz (Cardinal) on Jan 28, 2009 at 10:01 UTC

      A similar problem is addressed in Higher order Perl. (Chapter 1, page 35)

      citromatik

      This does indeed sound like an optimization problem on the order of NP-Complete.

      The standard NP-Complete routine I use (when lacking a problem or domain specific optimization) is a recursive solution that tries to cover all combinatorial possibilities, but will short circuit (prune the tree) in the case where the fitness function would never succeed from that point on.

      The key functions I define are apply_move() (applies the move to a current state), fitness() (tests the current state) and next_moves() (gets the set of valid moves). The first time I generalized this function was in a game theory course. An assumption is that you want to have your fitness function give an ordered rating from 0..MAXVAL (0 means total failure, MAXVAL means the best you can get). And yes, the tool I use for this has been heavily tainted by lisp (scheme, actually).

      sub find_solution { my ($state, $fitness_fcn, $nextmoves_fcn, $applymove_fcn, ) = @_; return () unless $fitness_fcn->($state); my @next_moves = $nextmoves_fcn->($state); return $state unless @next_moves; return map { find_solution($applymove_fcn->($state, $_), $fitness_fcn, $nextmoves_fcn, $applymove_fcn, ); } grep { $fitness_fcn->($applymove_fcn->($state, $_); } @next_moves; }

      * Code is provided with no warranty, is written from memory, and has not been tested.

      Depending on the complexity of $applymove_fcn, it may make sense to apply a ST to this as well, but for understanding the code, it is not necessary.

      If you have a time constraint, you could even sort the results of each level to that you try the most fit (to that point) solutions first.

      For your specific problem, the fitness test could rank the solutions as the inverse total deviation from the average or something similar (making a perfect solution 1/0 :-). The larger the fitness test result, the better the solution (for some definition of better). If you want to limit the overage or underage of any specific customer, you would rank the fitness of those solutions exceeding these limits as 0. Most of the tweaking for a faster method is then done in the fitness function. The faster you can prune a branch of the result tree, the faster you will find your solution.

      But this is just one solution. If there is a domain or problem specific solution, that will, more than likely, give a faster solution or solutions.

      --MidLifeXis

Re: Average Price Algorithm
by Corion (Patriarch) on Jan 28, 2009 at 08:58 UTC

    If this is a homework problem, please show us the code (or ideas) you have so far.

    If this is a business problem, like for example, allocating commodities or equities to bidders in a market, your market (likely, some kind of exchange) most likely already has defined fixed rules on how the allocation has to happen. It's best to use these rules. If your market does not have such rules, look at the different allocation rules of different yet similar markets and adopt one of those.

      I stopped doing homework many years ago :)

      This kind of problem doesn't have any rules. There will be an optimum solution for each scenario. The trick is being able to search the possible result space in a reasonable timeframe.

      The problem I'm having is in generating all the possible combinations, because the optimum solution for each person may involve some fragments from all price levels or fragments from just a few price levels.

      I'm thinking that a recursive approach might work, but this algorithm is a bit trickier to devise than factorial, etc.!

        Do you require the optimum solution? Is this a real world problem (if so, what are the real world constraints)? What constitutes a "reasonable time frame"? Is the number and size of commodity packets representative and is the number of distribution packets representative?


        Perl's payment curve coincides with its learning curve.
Re: Average Price Algorithm
by weismat (Friar) on Jan 28, 2009 at 09:32 UTC
    This is a fairly difficult discrete linear optimisation problem. I think that writing the solution in Perl is just the tip of the iceberg for this problem.
      Yes, it's a difficult problem.  Where's Abigail when you need him :)
Re: Average Price Algorithm
by ELISHEVA (Prior) on Jan 29, 2009 at 12:45 UTC

    Note: Significantly updated, 2008-02-01, 12:30am ish

    Mathematically this is a bit different from a packing problem. A packing problem is a least upper/greatest lower bound problem. This problem centers on distributions and deviations. It might be better to think of it as a matching problem.

    To keep the bucket closest to the mean, whenever you give a bucket something with the price above the mean you also have to give it something below the mean. Originally, I thought this would be best accomplished with a two phase algorithm: first get rid of the items that let every bucket have the same average as the overall sample mean. Next get rid of the oddballs. That algorithm is described here:

    I've since reconsidered my algorithm, see note below. Best, beth

      Limbic-Region suggested I spend time comparing that algorithm with his (above) and after consideration I think there is a much better algorithm than the one I proposed above.

      Limbic-Region's algorithm I'm nearly sure is provably optimal, [update: provided it is adjusted to take into account relative factorization - see below]. It insures that the largest deviations always end up in the largest baskets where they are least likely to have a significant impact. Also his algorithm uses the values with the smallest deviation from the mean first. This insures that a bucket whose size prevents an above-mean deviation from being matched with a below-mean deviation will always have the smallest possible unmatched deviation, and hence the bucket average closest to the sample mean.

      On the other hand the algorithm proposed by Limbic-Region does not take full advantage of the histographic information so needs to do a binary search for each smallest deviation item, not to mention a lot of testing for best fit. This, I think can be avoided by ranking values by their distance from the mean before we begin filling the buckets.

      The code for a more histographic aware version of Limbic-Region's algorithm is posted below (without correction for relative factorization). The code has been tested on Debian-Etch/Perl 5.8. For demo purposes it prints out the results of allocating 6000 items into 3 buckets for three distributions: (a) a spike with all values are at the mean (b) a symmetrical distribution, i.e. no skew and (c) a seriously skewed distribution. I've also included a demo run for the original poster's distribution.

      BTW, the algorithm below is 0(N) where N=number of items to allocate to buckets. If I'm right about Limbic-Region's strategy being provably optimal, then this problem is very far from NP complete.

        The test sample:

        demoAllocation ( "Distribution: nasty", {a => 30, b => 30}, { '1.0' => 40, '2.0' => 10, '4.0' => 10, } );

        Prints:

        Distribution: nasty a: 17 @ $1.00 10 @ $2.00 3 @ $4.00 bucket avg: $1.63, deviation: $-0.033 b: 23 @ $1.00 7 @ $4.00 bucket avg: $1.70, deviation: $0.033

        However inspection of the original data suggests that simply allocating half of each original parcel to the two groups gives a perfect result.


        Perl's payment curve coincides with its learning curve.
        ELISHEVA,
        Thank you for taking the time to code this up. I agree, and pointed out, that there were optimizations to be made from my approach as I did it hastily. I am not at all convinced that it produces optimal results for all data sets. I think it is a good heuristic approach given the requirements (runtime of less than 3 seconds with small margin of error).

        In order to disprove that it doesn't always produce optimal results, I don't need the rigor of a formal proof - I just need to come up with an example. If you are interested in such an example, it shouldn't be too difficult. A small enough data set should allow all possible partitions and combinations which can then be compared against the results of the heuristic algorithm. If you want, I can code that up?

        Cheers - L~R

      I was sufficiently frustrated by the discussion in the note above, that I decided to just go ahead and correct the algorithm for situations where there are same sized buckets. The code posted below solves Grandfather's "Nasty distribution" perfectly. It is still O(N) where N is the number of items to allocate.

      The algorithm below is a hybrid of the one proposed by Limbic-Region and the one I originally proposed. Smallest buckets are filled first (as per Limbic-Region), but when we encounter N equal sized buckets we allocate items N (one per bucket) at a time until we can't. Then we go back to allocating items to buckets one by one. This prevents the first of N buckets from hogging all the "good" values.

      I'd still very much appreciate it if someone could find an example or two that doesn't work with this algorithm. Though I'm not convinced that this problem is anything like NP complete, I'm sure the code I've posted has room for improvement.

      Best, beth

      Update 1: fixed bug in code that was originally posted.

      Update 2: (Feb 3, 8:00 UTC) fixed bucket count bug identified below by Grandfather and Limbic~Region. I also made the output scream **ERROR** if this kind of mistake happens again. This is only a bug fix, however. The bucket counts are now right but the distributions discussed below - "Nasty mark II (Grandfather)" and "Limbic~Region" are still suboptimal.

        demoAllocation ( "Distribution: skewed: Nasty mark II (Grandfather)", {a => 30, b => 20, c => 10}, {'1.0' => 30, '2.0' => 12, '4.0' => 6, '8.0' => 12} );

        Prints:

        Distribution: skewed: Nasty mark II (Grandfather) a: 22 @ $1.00 8 @ $8.00 bucket avg: $2.87, deviation: $-0.033 b: 8 @ $1.00 6 @ $2.00 2 @ $4.00 4 @ $8.00 bucket avg: $3.00, deviation: $0.100 c: 6 @ $2.00 4 @ $4.00 bucket avg: $2.80, deviation: $-0.100

        However, allocating 1/2 of each parcel to a, 1/3 to b and 1/6 to c gives a perfect result.


        Perl's payment curve coincides with its learning curve.
        Many thanks again to Grandfather and Limbic~Region! Both for the bug find and the sub-optimal examples.

        Both examples are food for thought. Limbic-Region's distribution is interesting because it is a good example of where looking ahead only to the next largest deviation does not yield an optimal result. The smallest sized buckets can only be optimally filled by ignoring 6.0 even though it is smack on the mean (a deviation of 0). Grandfather's "nasty mark II" shows clearly that least(greatest?) common denominators (i.e. 2) and not just individual bucket size or frequency of bucket size affect the result.

        Best, beth
        ELISHEVA,
        demoAllocation ( "Distribution: Limbic~Region", {a => 3, b => 4, c => 2, d => 2}, {'1.0' => 1, '2.0' => 1, '3.0' => 1, '4.0' => 1, '5.0' => 1, '6.0' + => 1, '7.0' => 1, '8.0' => 1, '9.0' => 1, '10.0' => 1, '11.0' => 1 } ); __DATA__ Distribution: Limbic~Region a: 1 @ $2.00 1 @ $3.00 1 @ $9.00 bucket avg: $4.67, deviation: $-1.333 b: 1 @ $1.00 1 @ $10.00 1 @ $11.00 bucket avg: $7.33, deviation: $1.333 c: 1 @ $5.00 1 @ $6.00 1 @ $7.00 bucket avg: $6.00, deviation: $0.000 d: 1 @ $4.00 1 @ $8.00 bucket avg: $6.00, deviation: $0.000

        It is easy to show that this is not the best result. Consider a perfect distribution:

        • 1, 11, 6
        • 2, 10, 3, 9
        • 7, 5
        • 8, 4

        Cheers - L~R

      Sorry - firefox double posted this.
Re: Average Price Algorithm
by Limbic~Region (Chancellor) on Jan 29, 2009 at 01:55 UTC
    camelcom,
    The following heuristic approach does amazingly well with the given data set. It can be improved further by not calculating the last fragment since whatever is left must go in it. I left it the way it is in case it was possible that the total of all fragments was less than the total quantity.

    The algorithm works as follows:

    • Fill the fragments with the fewest items to the most
    • Choose the item that will bring the average closest to the target average

    The binary search could be improved and some of the math is duplicated so there are speed improvements to be had. There may also be bugs to be squashed as I wrote it in a hurry.

      If you add the line:

      ++$fragment{$frag}{$val};

      following ++$fragment{$frag}{items}; in the innermost loop in the main block you record the quantities at each price that were used to achieve each person's package.

      Oh, small nit (that may be covered by "I left it the way it is ...")- you didn't allocate anything to E because it wasn't included in %fragment.


      Perl's payment curve coincides with its learning curve.
      Thanks Limbic, but I'm not sure this approach will find the best solution. I know that, for the given example, there is a selection which gives a maximum error in all buckets of 0.0001754 (3 bucket averages being exactly the average).

      Although this approach looks correct, often the best solution will be some blend of fragments which this methodology will not find. I'm thinking some kind of brute-force method may be even better, but still working on options.

        (posted here instead of one level up because it seems to expand on the thoughts of the parent)

        ++ to Limbic~Region for the algorithm (grandparent node). I like it.

        It does appear, however, that it is minimizing the error once for each individual fragment, rather than minimizing the error for some summation function (absolute error, statistical deviation, etc) over the entire population.

        As a "good" solution, I think that this is excellent, and it seems to be similar to a "good" solution to the bin packing problem that I remember from my undergrad days. It can be fooled with some nasty data.

        Like any other NP Hard or optimization problem, an approximation's "good enough" value is in the eye of the beholder, and if you are willing to calculate (or wait for) the optimum solution.

        You could add an optimization loop to the end of this algorithm, and bound it with whatever time you are willing to wait (or some other limiting factor), and start swapping values from the bins (picking which ones will give the best change is left as an exercise - perhaps swapping something from the best and the worst, or the two worst, or the worst over and the worst under ), recalculating the fitness, saving the state if better, and repeating if you still have cycles left.

        --MidLifeXis

        camelcom,
        This is a heuristic solution and the problem appears NP hard/complete. That means it is guaranteed not to produce the best solution for all data sets. That is why I said it did amazingly well but didn't say perfect. You indicated you had an acceptable margin of area error and it looked like this could be made to fit your criteria. It is also obviously fast and can be improved.

        Cheers - L~R

      I have checked the Dominus book. The partitioning problem shown is solved recursively, but it's slightly different to this problem in that (with this problem) you only know the quality of the fit once the overall selection (for a bucket) is made. Often there is some blend of allocations which fit really well, but it's easy to miss those if you get too clever with the algorithm.

      As for the measurement of fitness - I'm having trouble deciding, but I think the sum of the errors is reasonable. Certainly not the max error because that will most often occur for allocations which are very small (1,2 or 3).

      With monks help, I think I'm starting to get a better handle on this problem - I have an idea for a new approach based on a combination of a heuristic and brute-force.

      I work in an organisation which is (supposedly) stuffed full of smart people, but when I get stuck I always come to the monks first :)

        sum of absolute errors, I think, is what you mean. Otherwise your fitness function should always return 0 ;-)

        You may be better off using one of the statistical error calculations (Portal:Statistics). IIRC (and stats was not my strong suit 15+ years ago, so apply salt as necessary), some of the more complex (well, more complex than summation anyway) error functions would give "better" results.

        --MidLifeXis

Re: Average Price Algorithm
by blokhead (Monsignor) on Jan 29, 2009 at 17:47 UTC
    I agree with the hunches of my esteemed colleagues -- this problem certainly smells NP-complete. Of course, the details depend on the generalization of the problem, the range of values, and how the objective function is defined (how do you measure "closeness" of a set of numbers to the average?) But since it's probably NP-complete, one would imagine looking for some reasonable heuristics / approximations.

    I propose the following mind-numbingly simple algorithm:

    1. If a person wants N items, then just allocate a random N items to the person.
    Sounds like a joke, but think about it. From each person's point of view, they get a random sample of the available items. That person's "score" is the mean of their sample. But the expected value of a sample mean is the overall mean. So at least in expectation, each person's allocation tends to the overall mean.

    Next, the standard approach for "refining" an algorithm that only has good expected performance is:

    2. Run many trials of #1 and take the best one
    To formally analyze the behavior of the repeated-trials approach, you'd have to do some more statistical analysis that incorporates variances as well (to know how likely each trial is to get close to the objective). I can't make that back-of-the-envelope statistical calculation now, but at least in the sample you gave, the variance among the items seems very small. Thus the variances of the sample means are also small. So I would expect this algorithm to do fairly well. This makes sense, because the difficulty of the problem seems related to how varied the individual items' scores are -- the distribution of the sample mean is greatly influenced by outliers, for example.

    blokhead

Re: Average Price Algorithm
by Bloodnok (Vicar) on Jan 28, 2009 at 09:00 UTC
    Is it just me, or is there the distinct smell of homework in this node ??

    There's no indication of any efforts on behalf of the poster thus far - apart from a calculation of the average price - a task which is, to say the least, trivial in the extreme.

    A user level that continues to overstate my experience :-))
      The problem as given is not well-defined, which is a pretty clear sign that it is not homework. It is not well-defined because it is unclear what the metric is for the "best" fit. Are we trying to minimize the maximum error? Are we trying to minimize the average error? Are we trying to minimize the sum of the squares of the errors? All three would fit the problem description, and all three can lead to different optimal answers.

      The fact that the problem looks NP complete is just icing on the "probably not homework" cake.

      Sounds more like he's trying to hand out shares of Apple stock as awards. (I hope they're not back-dating them again) That's the only good reason I can think of to need to track the original prices. However, use of the term "commodity" is completely wrong as that denotes an asset that has no distinguishing characteristics from unit to unit. If it were truly a commodity, the answer would be "charge everyone the average price and it doesn't matter which units they get".

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://739455]
Approved by Corion
Front-paged by MidLifeXis
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others cooling their heels in the Monastery: (6)
As of 2024-03-28 15:52 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found