As fashion retailers look to reduce inventory levels and avoid excessive markdowns, the precision of forecasting sales at the style and size level becomes critically important. However, given the speed at which retailers must make informed decisions, accurately forecasting sales at this level of granularity becomes difficult to manage without applying data science techniques. One possible forecasting approach is to forecast at the item level (either style or color), and then spread the forecast down to the item’s sizes. However, the spreaddown requires predicting the fraction of sales that will go to each size. This set of fractions is called the size profile of the item, and estimating the profile from the item’s historical sales is the size profile problem.
While the distribution of sizes in the population changes only very slowly over time, size profiles at a retailer may change much more quickly due to the mix of consumers who are attracted to the retailer, and the mix of consumers who are attracted to particular classes of merchandise at the retailer. It is possible that changes in the merchandise in the class may attract a different mix of sizes of people—in fact, at many retailers such merchandise changes can happen quarterly. For this reason, frequent reestimation of size profiles is a necessity, and such retailers are good candidates for using data sciencebased software to perform the reestimation. This blog post will give examples of how Oracle Retail’s Size Profile science service, available as part of Oracle Retail’s Science Platform Cloud Service (April 2019), makes use of some classic machine learning techniques to estimate size profiles from a retailer’s historical sales data.
A note on the scope of the problem we will consider in this post: clothing customers might certainly purchase a nearby size if their preferred size is out of stock, thus possibly inflating sales of those nearby sizes. Retailers using Oracle Retail’s Size Profile science service assume that such substitution effects are small, and so are not worth the trouble to account for. (The substitution effects may be even smaller once returns are accounted for, since customers not purchasing their preferred size are much more likely to return their purchases.) In addition, while modeling such transference could be done with cross effect parameters, cross effects are notoriously difficult to estimate. For all these reasons, we have omitted the modeling of such cross effects, and will not be discussing them further in this blog post.
An example from summer apparel
Suppose in a retailer’s historical data, an item sold for 7 weeks at one store during the previous summer season, with the following sales units for each size:
Week1 
Week2 
Week3 
Week4 
Week5 
Week6 
Week7 
total 

Size S 
1 
0 
3 
0 
4 
3 
0 
11 
Size M 
8 
8 
11 
7 
6 
3 
3 
46 
Size L 
4 
2 
7 
6 
6 
2 
3 
32 








89 
For the upcoming summer season, we want to estimate the size profile for the item at this store based on its history. In this case, we need no data science more sophisticated than some basic arithmetic to solve the problem, and our prediction for the size profile is simply (11/89, 46/89, 32/89). However, sales history is never so neat and simple, and typically we face the problem of “censored demand,” meaning we don’t have reliable information for some of the cells in the above table:
Week1 
Week2 
Week3 
Week4 
Week5 
Week6 
Week7 
total 

Size S 
1 
0 
x 
0 
4 
x 
0 
? 
Size M 
8 
8 
11 
7 
6 
x 
x 
? 
Size L 
x 
2 
x 
6 
6 
2 
3 
? 








? 
The “x” denotes those weeks where the size was not fully available for sale due to inventory for the size becoming 0 during the week, and as a result we do not know what the size’s true demand would have been for the week. Or, for some other reason, reliable data for the sizeweek combination is unavailable. In any event, we can no longer just sum up the sales of each size, particularly if many cells have X’s. It is no longer a basic arithmetic problem.
One solution here is to model the demand as Poisson in each cell, and then apply maximum likelihood to estimate the size profile. In this approach, we introduce three variables, SpS, SpM, Sp_{L, }for the size profile, and seven variables, d_{1} through d_{7}, for the total demand of the item during each of the weeks above. The Poisson rate in the cell for size m and week w is then Sp_{m} · dw, and maximizing the likelihood function based on the above historical data yields estimates for these variables. Choosing the Poisson distribution makes this maximization quite fast, an important consideration when the task is to determine size profiles for the hundreds of thousands of items a fashion retailer might have.
Note that the likelihood function does not include factors for the censored cells, and thus the estimation is based entirely on the cells with noncensored demand. The estimation uses only the available evidence, and the censored cells are excluded because they provide no evidence. (In the case where, for example, the inventory went to 0 during the week instead of at the beginning of the week, we may have partial evidence for the demand in the cell, but for this discussion we will ignore techniques to handle such partial evidence.)
After the estimation, we throw away the d_{w} estimates, and retain only the size profile estimates. However, the d_{w} are essential during the estimation, because total demand for the item can vary significantly across the weeks due to timerelated effects or actions by the retailer such as promotions or markdowns. Such changes in demand affect each size equally, and so it is unnecessary to have a demand variable per sizeweek combination—a single d_{w} for the entire week is sufficient. For example, retailers do not change prices or run promotions on individual sizes.
In practice, we group similar items together, and obtain a single size profile for the entire group of items. In the estimation, this means a single set of Sp variables across the entire group, but the d_{w} variables are specific to each item because the demand effects, such as promotions, might not be the same for all items.
Instead of the Poisson model of demand, suppose we use the more usual multiplicative model for demand, which in this case would simply be S_{m,w} = Sp_{m} · dw, where S_{m,w} is the units for size m and week w. To estimate this, we would turn this into a loglinear model by taking logs: logS_{m,w} = logSp_{m} + logdw. It is now a matter of jointly estimating logSp_{m} and logdw by linear regression, using again only the uncensored sizeweek cells. However, here we run into a problem, since many of the S_{m,w} could be 0 and we cannot take the log of 0. For highselling items whose sizes are also highselling, we could run our linear regression after eliminating the censored and 0 S_{m,w}; if sales are high enough, few S_{m,w} will be 0, and this approach would be reasonable. However, far more frequently, sales of a size are low even if the item itself is a high seller, because even a highselling item can have sizes which don’t sell that frequently. Thus, it is essential to choose a method of estimation which can handle S_{m,w} = 0. For lowselling sizes, cells with 0 are crucial to correctly determining the rate of sale of the size, and so cannot be ignored.
Let’s take a closer look at the difference between the maximum likelihood approach and the basic arithmetic approach when the data is censored. Suppose that in the above censored example, the weekly demands dw are actually constant across the weeks. Then the difference between the two approaches amounts to the following. Consider calculating the average rate of sale of size L. In the basic arithmetic approach, the total, 2 + 6 + 6 + 2 + 3 = 19, would be divided by the number of weeks, 7, to get a rate of sale of 19/7 ≈ 2.7 units per week. In contrast, if we stick to the principle of using only the available evidence, we would calculate 19/5 ≈ 3.8 units per week, where the denominator is 5 because 2 of the weeks are censored and so shouldn’t be used as evidence. In a nutshell, this is what the maximum likelihood calculation is doing. (Thus, if the dw were actually constant, the maximum likelihood calculation could again be reduced to basic arithmetic, indicating that the varying dw is what makes the censored problem difficult.)
Note that for size S, using the available evidence gives a rate of sale of (1 + 0 + 0 + 4 + 0)/5 = 1 per week. The 0 cells are crucial here, and we get a very different answer, 5/2, if we ignore the 0 cells, as would happen with a loglinear model.
Unfortunately, the approach of using only the available evidence contains a subtle problem that only becomes evident when we have sizes with a very low rate of sale. In the next installment of this blog, we will cover the problem and a possible solution. Stay tuned!