As fashion retailers look to reduce inventory levels and avoid excessive markdowns, the precision of forecasting sales at the style and size level becomes critically important. However, given the speed at which retailers must make informed decisions, accurately forecasting sales at this level of granularity becomes difficult to manage without applying data science techniques. One possible forecasting approach is to forecast at the item level (either style or color), and then spread the forecast down to the item’s sizes. However, the spread-down requires predicting the fraction of sales that will go to each size. This set of fractions is called the size profile of the item, and estimating the profile from the item’s historical sales is the size profile problem

While the distribution of sizes in the population changes only very slowly over time, size profiles at a retailer may change much more quickly due to the mix of consumers who are attracted to the retailer, and the mix of consumers who are attracted to particular classes of merchandise at the retailer. It is possible that changes in the merchandise in the class may attract a different mix of sizes of people—in fact, at many retailers such merchandise changes can happen quarterly. For this reason, frequent reestimation of size profiles is a necessity, and such retailers are good candidates for using data science-based software to perform the reestimation. This blog post will give examples of how Oracle Retail’s Size Profile science service, available as part of Oracle Retail’s Science Platform Cloud Service (April 2019), makes use of some classic machine learning techniques to estimate size profiles from a retailer’s historical sales data. 

A note on the scope of the problem we will consider in this post: clothing customers might certainly purchase a nearby size if their preferred size is out of stock, thus possibly inflating sales of those nearby sizes. Retailers using Oracle Retail’s Size Profile science service assume that such substitution effects are small, and so are not worth the trouble to account for. (The substitution effects may be even smaller once returns are accounted for, since customers not purchasing their preferred size are much more likely to return their purchases.) In addition, while modeling such transference could be done with cross effect parameters, cross effects are notoriously difficult to estimate. For all these reasons, we have omitted the modeling of such cross effects, and will not be discussing them further in this blog post.

An example from summer apparel

Suppose in a retailer’s historical data, an item sold for 7 weeks at one store during the previous summer season, with the following sales units for each size:

 

Week1

Week2

Week3

Week4

Week5

Week6

Week7

total

Size S

1

0

3

0

4

3

0

11

Size M

8

8

11

7

6

3

3

46

Size L

4

2

7

6

6

2

3

32

 

 

 

 

 

 

 

 

89 


For the upcoming summer season, we want to estimate the size profile for the item at this store based on its history. In this case, we need no data science more sophisticated than some basic arithmetic to solve the problem, and our prediction for the size profile is simply (11/89, 46/89, 32/89). However, sales history is never so neat and simple, and typically we face the problem of “censored demand,” meaning we don’t have reliable information for some of the cells in the above table: 

 

Week1

Week2

Week3

Week4

Week5

Week6

Week7

total

Size S

1

0

x

0

4

x

0

?

Size M

8

8

11

7

6

x

x

?

Size L

x

2

x

6

6

2

3

?

 

 

 

 

 

 

 

 

?

The “x” denotes those weeks where the size was not fully available for sale due to inventory for the size becoming 0 during the week, and as a result we do not know what the size’s true demand would have been for the week. Or, for some other reason, reliable data for the size-week combination is unavailable. In any event, we can no longer just sum up the sales of each size, particularly if many cells have X’s. It is no longer a basic arithmetic problem.

One solution here is to model the demand as Poisson in each cell, and then apply maximum likelihood to estimate the size profile. In this approach, we introduce three variables, SpS, SpM, SpLfor the size profile, and seven variables, d1 through d7, for the total demand of the item during each of the weeks above. The Poisson rate in the cell for size m and week w is then Spm · dw, and maximizing the likelihood function based on the above historical data yields estimates for these variables. Choosing the Poisson distribution makes this maximization quite fast, an important consideration when the task is to determine size profiles for the hundreds of thousands of items a fashion retailer might have. 

Note that the likelihood function does not include factors for the censored cells, and thus the estimation is based entirely on the cells with non-censored demand. The estimation uses only the available evidence, and the censored cells are excluded because they provide no evidence. (In the case where, for example, the inventory went to 0 during the week instead of at the beginning of the week, we may have partial evidence for the demand in the cell, but for this discussion we will ignore techniques to handle such partial evidence.)

After the estimation, we throw away the dw estimates, and retain only the size profile estimates. However, the dw are essential during the estimation, because total demand for the item can vary significantly across the weeks due to time-related effects or actions by the retailer such as promotions or markdowns. Such changes in demand affect each size equally, and so it is unnecessary to have a demand variable per size-week combination—a single dw for the entire week is sufficient. For example, retailers do not change prices or run promotions on individual sizes. 

In practice, we group similar items together, and obtain a single size profile for the entire group of items. In the estimation, this means a single set of Sp variables across the entire group, but the dw variables are specific to each item because the demand effects, such as promotions, might not be the same for all items.

Instead of the Poisson model of demand, suppose we use the more usual multiplicative model for demand, which in this case would simply be Sm,w = Spm · dw, where Sm,w is the units for size m and week w. To estimate this, we would turn this into a log-linear model by taking logs: logSm,w = logSpm + logdw. It is now a matter of jointly estimating logSpm and logdw by linear regression, using again only the uncensored size-week cells. However, here we run into a problem, since many of the Sm,w could be 0 and we cannot take the log of 0. For high-selling items whose sizes are also high-selling, we could run our linear regression after eliminating the censored andSm,w; if sales are high enough, few Sm,w will be 0, and this approach would be reasonable. However, far more frequently, sales of a size are low even if the item itself is a high seller, because even a high-selling item can have sizes which don’t sell that frequently. Thus, it is essential to choose a method of estimation which can handle Sm,w = 0. For low-selling sizes, cells with 0 are crucial to correctly determining the rate of sale of the size, and so cannot be ignored.

Let’s take a closer look at the difference between the maximum likelihood approach and the basic arithmetic approach when the data is censored. Suppose that in the above censored example, the weekly demands dw are actually constant across the weeks. Then the difference between the two approaches amounts to the following. Consider calculating the average rate of sale of size L. In the basic arithmetic approach, the total, 2 + 6 + 6 + 2 + 3 = 19,  would be divided by the number of weeks, 7, to get a rate of sale of 19/7 ≈ 2.7 units per week. In contrast, if we stick to the principle of using only the available evidence, we would calculate 19/5 ≈ 3.8 units per week, where the denominator is 5 because 2 of the weeks are censored and so shouldn’t be used as evidence. In a nutshell, this is what the maximum likelihood calculation is doing. (Thus, if the dw were actually constant, the maximum likelihood calculation could again be reduced to basic arithmetic, indicating that the varying dw is what makes the censored problem difficult.)

Note that for size S, using the available evidence gives a rate of sale of (1 + 0 + 0 + 4 + 0)/5 = 1 per week. The 0 cells are crucial here, and we get a very different answer, 5/2, if we ignore the 0 cells, as would happen with a log-linear model.

Unfortunately, the approach of using only the available evidence contains a subtle problem that only becomes evident when we have sizes with a very low rate of sale. In the next installment of this blog, we will cover the problem and a possible solution. Stay tuned!

Su-Ming Wu
Author
Su-Ming Wu

Su-Ming is a Senior Principal Data Scientist in Oracle’s Retail Global Business Unit. He has a Ph.D. in mathematics and an MS in computer science, and has been working on retail-science problems for 7 years.