In the previous blog post, we covered the size profile problem and covered one possible solution. Unfortunately, the solution has a subtle difficulty.

Continuing from the example of censoring in the previous blog post, consider the following modification to the size S of the example item:










Size S









Size M









Size L

















In this case, the size S for this item is in extremely low demand and, knowing that, the retailer ordered only one unit of this size. This one unit happened to sell in the first week and so the remaining weeks were stocked out of this size, and thus censored. (Thus, the “x” in size S denotes inventory is 0. However, in the other sizes, “x” continues to mean, as in Part I of the blog, “unreliable demand information due to any cause.”) Applying the principle of using only the available evidence, if we calculated the rate of sale, we would simply take 1/1 to get a rate of sale of 1 unit per week, which is a vast overestimate compared to the retailer’s estimate of 1 unit for the entire 7 weeks. Consequently, the maximum likelihood estimate of the sales profile would allocate far too much to size S, and the retailer would likely find this overestimate unacceptable.

The situation is actually worse than this one example shows. Consider the possible sales patterns that could occur for the size S, keeping in mind that the size is stocked out as soon as a single unit sells:

1 x x x x x x
0 1 x x x x x
0 0 1 x x x x
0 0 0 0 0 0 1

In only one of these cases, namely the last one, would we obtain the correct sales rate of 1/7 unit per week. In all other cases, the estimate would be an overestimate. Moreover, assuming demand in each week is independent, the last sales pattern above is the least probable since the single unit of the item must avoid selling for 6 weeks in a row and then sell in the 7th week. Thus, the most probable scenario is that maximum likelihood will produce an overestimate for the size S—even a severe overestimate—leaving the retailer with excess inventory that must be marked down, possibly at a loss. 

Such an overestimate is much likelier to happen with very low sellers, because it is much more probable that a low seller’s sales will be pulled earlier in the season as there are fewer sales that would have to be pulled earlier. Thus, any method of correcting the overestimation should distinguish between the low-selling sizes and high-selling ones, so that only the low-selling ones can be corrected. However, purely from the historical data, we cannot know which ones need correcting. For example, from the sales pattern of 1 x x x x x x, we cannot claim that the calculated sales rate of 1 per week is too high unless we presuppose that the size would NOT have sold in the stocked-out weeks had it been available. And from just the historical data alone, we have no reason to suppose this.

One additional piece of information we have is the retailer’s estimate for the total sales of size S for the 7 weeks, that is, the total units of size S that the retailer ordered (in this case just 1). We recognize also that the retailer’s estimate could be mistaken, though with a probability distribution on how far off the retailer’s estimate is. For example, it is highly unlikely that the true total demand for size S is 100, if the retailer only ordered 1, but quite possible that the total demand is actually 2 and not 1. This thought process leads us to consider Bayesian techniques.

To keep the Bayesian example manageable, we’ll assume as before that the demands dw are constant, so that their estimation plays no role in the example. As mentioned earlier, the estimation then simply comes down to calculating the average rate of sale for size S over the non-censored weeks. However, now we want to alter the estimation to include the retailer’s estimate of size S’s total demand as a prior.  Here, we put a distribution on the total demand I of the size S, with the mean I0 being the retailer’s estimate, obtained from the total units the retailer bought for size S over the 7 weeks. For the distribution, we choose a gamma distribution, as it is the conjugate prior to the Poisson distribution.  The distribution is thus

Screen Shot 2019-03-14 at 9.58.32 PM

This is the gamma distribution written in a form that makes its mean be I0, and α is an unknown parameter. (The gamma distribution has two parameters, but if the mean is known, then only one parameter is left to be specified.) In this notation, the variance is then

Screen Shot 2019-03-14 at 9.58.42 PM

The posterior distribution on I is then proportional to

Screen Shot 2019-03-14 at 9.58.52 PM


Screen Shot 2019-03-14 at 9.59.04 PM

is the Poisson likelihood for the size S that we would have used in the maximum-likelihood estimation above, and kw are the sales units for size S for non-censored week w (the product over w runs only over non-censored weeks). Thus, the posterior distribution accounts for both the likelihood from the evidence kw of the non-censored weeks and the prior distribution on I. To actually make L(I)ƒ(I|α) into a probability distribution, we have to normalize it:

Screen Shot 2019-03-14 at 9.59.11 PM

Suppose α were known. The last step of the Bayesian approach is to get an actual value for I by taking the expectation of I using the above distribution:

Screen Shot 2019-03-14 at 9.59.18 PM

This value is then our new value for the total demand of size S, replacing the estimate implied by our earlier maximum-likelihood estimate that used L(I) without ƒ(I|α). We can repeat the above Bayesian approach for each size of this item, obtaining the total demand of each of the sizes, and from the total demands then obtain the size profile for the item. One way to view the expectation calculation is that we are taking a weighted average of all the possible values that I could be, where the weight is given by L(I)ƒ(I|α). And just as in any weighted average, we have to divide by the sum of the weights (which is in the integral in the denominator).

The necessity of calculating the integral in the denominator is the reason for choosing ƒ to be the gamma distribution. The integrand L(I)ƒ(I|α) still has the form of a gamma distribution, which is why the integral is expressible in terms of the gamma function. (In fact, the integrand in the numerator also has the form of a gamma distribution, which is why the numerator is also expressible in terms of the gamma function.)

In practice, we would probably apply the above correction procedure only to those sizes which have very low total demand, say where the retailer bought 10 or fewer units for the size for the entire selling season because, as we mentioned earlier, it would typically be low-selling sizes where the maximum-likelihood approach would give an inflated estimate.

The above Bayesian procedure accounts for the evidence specific to every size of every item, even if α were constant over all items and sizes since the likelihood L(I) is based directly on the sales evidence and so every size of every item would obtain its own specific estimate for I. The ability to have estimates at the very lowest level is a hallmark of Bayesian techniques, and results from their ability to account for evidence at the very lowest level. 

The above presumes that we know the value of α and so now we come to the problem of estimating α from historical data. Here we follow the procedure from page 278 of the paper [FaderHardieLee], and we obtain the likelihood L(kw|α) for the size S in our example by integrating the likelihood we already used above:

Screen Shot 2019-03-14 at 9.59.29 PMThe notation L(kw|α) is an expression for the likelihood for historical sales kw in terms of α being unknown. Instead of just using L(I) alone, which is what our maximum likelihood estimation of size profiles does, our view here is that we do not have a single value for I but instead a distribution ƒ(I|α). That is why we take a weighted average of L(Iagainst this distribution. 

Now we take log of L(kw|α) and sum up all of the items to which the estimated value of α will apply, and find the α which maximizes this total log likelihood. In practice, we may estimate a single α for the entire set of items in the group, but once again only using data for the lowest selling item-size combinations.

The estimated α is one measure of how accurate the retailer’s assessment of the total demand for low-selling sizes is. A smaller α means the variance code is larger, indicating less accuracy in the retailer’s assessment of the total demand. It is possible we can translate this into terms the retailer can find useful and so the α may provide independent business value aside from its use in improving size profile estimation for low-selling sizes. 

The paper [FaderHardieLee] employs these same Bayesian techniques (and the Poisson distribution) in a much more elaborate form and is instructive reading. The size profile problem, however, is simple enough that less elaborate Bayesian techniques are sufficient.

Related Articles 

Retailers Turn to Latest Oracle Demand Forecasting Service to Optimize Inventory

Machine Learning Optimizes Pricing, Lifts Revenue and Market Share for Premium Products


[FaderHardieLee] Fader, Peter S., Bruce G.S. Hardie,  Ka Lok Lee, “Counting Your Customers” the Easy Way:  An Alternative to the Pareto/NBD Model, Marketing Science 24(2), Spring 2005, pp. 275-284.

Su-Ming Wu
Su-Ming Wu

Su-Ming is a Senior Principal Data Scientist in Oracle’s Retail Global Business Unit. He has a Ph.D. in mathematics and an MS in computer science, and has been working on retail-science problems for 7 years.