Optimizing Retail Discounts with Machine Learning


In this paper, we show how to apply machine learning to pricing and discounts. The goal is to create the optimal discounting strategy, which results in maximizing the total retail income. To do this, we use machine learning to predict sales based on given combination of discounts, and then we formulate and solve a mathematical model that allows combining the discounts to achieve the desired maximum outcome.

We also outline the reference architecture for this implementation. The architecture includes open source components which will work for today’s store chains which deal with a large number of variables, and need the decisions in near-real time, such as Sqoop, Kafka, Spark, Cassandra, and Big Data storage. We also point to the AWS specific implementation should it be desired.

The benefits of this paper can be summarized as follows:

  1. Sales volume  prediction using machine learning
  2. Maximizing gross sales through discount optimization with advanced machine learning
  3. Software implementation architecture

Prior work

We have looked at prior work but found them wanting.

In “How Retail Stores Can Use Machine Learning to Boost Their Sales” the author rambles about “AI being the right tool for the job.” Apart from grammatical mistakes, this article contains no practical information. However, it is an example of writing that is quite common today.

In “Machine Learning for Retail” the authors explain the background of pricing and discounting, and introduce the realities of customer behavior. They also give a good practical list of types of discounts.

The same authors also list the benefits. Let’s look at them.

  1.     Automating price prediction allowed to free a team of four or five analysts, increased prediction accuracy, and permitted to extend this analysis to more than 30 products which the manual analysis allowed. Our observation: these are general benefits of automation.
  2.     Planning for discounts allowed to plan for additional sales generated through the use of discounts. The inventories could now keep up. Our observations: these are the general benefits of integration sales and procurement.

However, the article stops there. In this, the article is a teaser, of sorts, “Hire us, and we will do good work.” I would like to see more technical details on the models. Another question is: if all that the implementation did was to automate linear regression, then this is not new machine learning, but automating the old one. Later, we will show how we address these question.

In “How to Use Machine Learning to Further Retail Analytic Capacity” the author correctly argues that Machine Learning will provide the following benefits

  1.     Watch the behavior of a particular customer or a group of customers (segment), analyze it, and react to it.
  2.     Add other data sources, such as social media, and customize the sales to a segment or an individual customer. You may end up creating individualized discounts and coupons.

While this is science fiction idea, but it is getting closer to becoming a reality. However, the article only draws the desired picture, without giving any details that could lead to this paradise result.

How our work is different

In this paper, we give as much detail as needed to start from scratch and implement a proper discounting system. In the spirit of the open world, Facebook has published the description of their face recognition system, Google revealed the theory of their automatic translation, and in general companies share their scientific achievements.

Why is that? The popular explanation is that “Algorithms are cheap, data and execution are what counts.” We will follow the same approach. We will explain how to collect data, how to formulate the price prediction model, and how to solve it.

We also give the software architecture, outlining the open source software components one needs to bring together to make his system work. We base it on our experience implementing such systems, and on the best practices that we teach in our training.

However, the real novelty of our work is formulating a mathematical model which ensures the best possible performance. Prior work described above was limited to this situation: given the planned discounts, predicts the sales. We go further: having the sales prediction in place, we produce recommendations on which combination of discounts to apply to produce the best possible outcome – be it gross volume maximization, brand strengthening, or any other management goal.

Sales Volume Prediction

Discounts have a ripple effect and should be studied holistically. For example, a 10-day discount may affect the sales for the following few weeks.

Other things you may want to understand are

  • How strong or weak a product is
  • How much you do you have to give away to drive result
  • How categories differed

Factors one needs to consider are

  • Seasonality
  • Depth of the discount
  • The duration of the promotion
  • The average sales without promotion
  • The display in the circular
  • The display and shelf placement in stores
  • The type of promotion (Buy 2 Get 1 Free, immediate discount, loyalty points);
  • The type of product (soda, water, shampoo);
  • The promotion elasticity (how much customers react to promotions on a given product)
  • The competitive pressure

In the end, you may get 20-30 predictor variable to start modeling sales. Still, with the complexities of data evaluation and collection, the sales volumes prediction is a standard linear regression problem.

Practically these data will come from the SQL databases already in place. The data is made available to the data scientists as batches for research, and as a continuous stream for the production phase.

Strategy Optimization

In optimizing discount strategy, we need to select an overall goal we are pursuing. One reasonable goal could be maximizing the total profit. There are other possible goals: maximizing brand exposure and capturing the market, to name a few. But the common feature in our approach is selecting optimization criteria, which may be different in your situation.

Let us say we want to optimize the overall income. Now let us formulate this mathematically. Let us say we have N products that we sell. Then the price of our products will be denoted as

P1, P2, P3, …, PN

We can also write denote them as

Pn, where n = 1,2,3, …, N.

Discounts are functions of a product line and can be denoted as Dm (n).

Dm (n) => R, where R is a number.

Thus Dm (n) is a discount number m which was applied to product n. If no discount was applied for products n, then Dm (n) = 0. But if, for example, a 10% discount was applied, then Dm (n) = 0.1.

We will denote the total number of discounts as M; thus, our discounts are listed as

D1, D2, D3, …, DM, or Dm, where m = 1,2,3, …, M.

When we say that a discount is a function, we mean that a discount is not just a 10% discount, it usually is a more complex rule, that applies only to specific times, and in fact, may change with time.

The number of discounts that we can apply at a certain time is usually limited. Thus, M is fixed.

The number M is determined by the burden of applying discounts, by the number of people who have to deal with them, by the additional logistics and warehousing considerations, as mentioned above in the section “The Reality of Discounting.”


The products and discounts together form a matrix. This matrix shows which discount has been applied, to which product line.

Dmn, where m = 1,2,3, …, M and  n = 1,2,3, …, N.

If the discount m has not been applied to product line n then  Dmn = 0. Thus, our matrix is sparse, that is, it has many zero elements.

Let us now denote the sales volume for each product line as Vn, where n = 1,2,3, …, N. There are as many sales volumes as there are product lines, which allows us to use the same index.

The total gross income for all product lines can be calculated as

When we write => max, it means that we want to maximize the total gross income from sales.

Thus, we have an optimization problem, which involves modeling sales, then finding the best possible discounts under the given limitations.


The maximization problem formulated above can be solved with gradient descent. Here is why. Imagine that we are dealing with a small discount X. Within certain boundaries, if we apply a discount X, we may expect the sales to increase in proportion to the discount, say, as A * X. Thus, our total gross can be calculated as

P * (1 – X) * V * (1 + AX)

Here P is price, X is a discount, V is the volume of sales, and A is a certain coefficient.

Granted, this equation holds true only for small values of discounts. For example, we already gave a 10% discount and are now considering another 0.5%. But that is exactly the situation we are interested in: in gradient descent, you vary the parameters just slightly and observe how the solution changes. This small step is called “learning step,” and it is one of the parameters you set for the algorithm.

You see then that the curve for gross volume locally is a parabola, a quadratic function of discount X. A parabola is a convex function, and gradient descent works well for such function. We can, therefore, formulate the algorithm.

  1. For a given combination of discounts, calculate expected sales volumes using linear regression.
  2. Calculate sales volume derivatives. Using gradient descent optimization step, adjust the discounts to arrive at a maximum gross profit, using a learning step S.
  3. For the new combination of discounts, repeat from step 1.

As you are doing these iterations, add conditions which will verify that the algorithm converges. If it starts to diverge, increase the regularization parameter for the linear regression. Stop calculations when you don’t get significant improvements in the gross sales.

Implementation Architecture

The algorithm we described can be implemented using any machine learning library, such as R or Python. For large volumes of data, it may be advisable to run the calculations on a Spark cluster.

The algorithm is a modified version of gradient descent, so anybody familiar with the internal implementation of gradient descent in, for example, ML Spark, will be able to derive his solution from the code in the ML Spark library.

Once the model is trained, the discounts can be offered for a day or a few days before the model needs to be retrained. Since the data will be likely stored in SQL databases, the following architecture can be suggested

If an Amazon AWS-based implementation is desired, the suggested architecture will look as follows

Concluding Advice

We have shown a novel way of optimizing discounts to maximize gross sales. The recommended approach for such enhancements usually includes quick prototyping, on the order of two to four weeks, to verify that model on a small scale and to prove that it is working.

The best group for such implementation consists of one or two data scientist who can also code and implement algorithms, and one or two data analysts who understand the realities of products, pricing, and discounts.

Once this is worked out, add automation. Then expand the project, adding integration with procurement, warehouse management, and overall planning. Such approaches have worked well for many, and they will sure work for you as well.

Practical Steps

Education plays a key role. Take the Machine Learning course from Elephant Scale. We also have an upcoming “Machine Learning for Retailers.”

You will be able to implement not just this system; you will creatively invent new great uses of Machine Learning at your organization. We can also guide you and help you implement this.

Image: The Money-Lender and his Wife, by Quentin Massys (Detail)

Leave a Reply

Your email address will not be published. Required fields are marked *