Every day we make decisions based on uncertain information.
Living in the UK, a fantastic example of this is if you should take an umbrella with you.
Weather forecasts are notorious.
They can predict clear a blue sky and it ends up bucketing it down.
That’s why most of the time I have an umbrella in my bag, just in case.
Its a regret minimising decision, in other words, the cost of bring my umbrella with me and not needing it is much lower than needing it and not having it.
Here is a fun picture to outline the possible states:

If we were to naively automate this system using a typical predict-then-optimise approach there would be two steps:
- Create a rain prediction model that outputs a yes or no to if it will rain.
- Create an optimisation process (in this case just a check to see if rain is predicted) to decide if I need an umbrella or not.
Im sure you can see a number of issues and some obvious ways to improve this approach.
We could do more that just predict a binary variable, , for if it will rain, we could predict a probability or how heavy the rain will be.
Doing so would complicate our optimisation step, perhaps now we use a threshold value.
However, this doesn’t address the underlying issue of, what if the prediction is wrong?
The optimisation process as no knowledge its using predictions, and more importantly, the model has learned to minimise its prediction error () with no regard for the optimisation.
Unlike the predict-then-optimise approach,
Decision-Focused Learning is the process of teaching machine learning models to generate predictions that when used to make a decision minimise regret.
The maths
Machine learning
The typical machine learning (ML) process is defined as follows.
Given a set of pairs of features and labels:
where
is an feature vector sampled from
and its values have some correlation to
Assuming that it exists, we want to learn the true function
. However, this isn’t possible, so we use the
data, , to learn a model where
that minimises a loss function between and .
A typical loss function might look something like
. For a parametrised model the learning
process looks like
where would be the optimal parameters for the
model given the data available.
Optimisation
In a classical optimisation process, we attempt to find the optimal
solution to a problem. is typically defined as being drawn
from the set of all valid solutions . Though in some notation
can be the space of all possible solutions, valid or not and
constraints are applied to the optimization process. We will use for
the valid solution space and for the larger space of all
possible solutions. In math notation,
where is a constraint on the solution.
The optimality of a solution is measured using a fitness function .
We can define the optimisation process as
Of note here is that can often be computationally expressive or
so large it is infeasible to try all possible solutions. As such,
methods must be used to reduce the search space to find the solution.
This is typically a combination of heuristics (often defined using
domain expertise) and advanced search methods such as genetic
algorithms. In fact, the machine learning problem defined in
[] is an optimisation problem.
In the case of the ML problem with a parametrised model such as a
neural network. It is possible to calculate gradients and use them to
direct the exploration of the solution space in order to find a solution
that is close to the true optimal solution i.e.
Predict-then-Optimize
In the predict-them-optimize setting, the optimization process depends
on some unknown values. To find the true optimal solution, the fitness
function must be conditioned on the unknown values . As such
[] must be reformulated to
. In a practical
sense, this is infeasible as is unknown. To circumvent this issue
the model can be used as is available. Using
we can generate a value to be used in the optimisation.
We use the notation here to denote it is the optimal
solution found using predictions of . If the model perfectly
reconstructs the true values of then the solution found should be
the true optimal solution i.e. .
Note that is typically high as all values needed to
parametrise the optimisation problem are predicted. As such it is
likely comprised of the output multiple models
where each model predicts an individual component e.g. travel time,
supply, demand, and cost.
This is simple to implement. Predictive models are fairly easy to
learn. Can use any optimization method.
No information about the optimization process is passed into the
model.
It has been shown that model misspecification can lead to extreme
errors when multiple correlated variables are combined in the
optimization process [@Cameron2022]
Decision-Focused Learning
In the PTO setting, we noted that when , should result in a
solution where . However, models learned using
[] will almost always have some error. The
error induced by sub-optimal solution selected by the fact
that is called the decision loss (DL). It is the value of the fitness function
of the solution generated using predicted values, evaluated
using the ground truth values of .
In a [DFL]{acronym-label=”DFL” acronym-form=”singular+short”} setting
our goal is now reformulated. We aim to learn a model that
minimises the [DL]{acronym-label=”DL” acronym-form=”singular+short”}.
For the dataset , the learning process would be:
-
Pros
Better performance.
-
Cons
Hard to implement; Can be more computationally expensive to learn;
Still have an (expensive) optimization step