# sPlot: a technique to reconstruct components of mixture

This post is devoted to the explanation of what an **sPlot** is. This well-known method was recently added to hep_ml library.

An **sPlot** is a way to reconstruct features of mixture components based on known properties of distributions. This method is frequently used in High Energy Physics.

## Simple example of sPlot

First start from a simple (and not very useful in practice) example.

Assume we have two types of particles (say, electrons and positrons).

The distribution of some characteristic is different for them (let this be the px momentum projection).

## Observed distributions

The picture above shows how this distribution should look like, but due to inaccuracies during classification we will observe a different picture.

Let’s assume that with a probability of 80% a particle is classified correctly (and we are not using px during classification).

When we look at the distributions of px for particles which were classified as electrons or positrons, we see that they were distorted. We lost the original shapes of distributions.

## Applying sWeights

We can think of it in the following way: there are 2 bins. In the first bin, 80% are electrons, 20% are signal. And vice versa in the second bin.

To reconstruct the initial distribution, we can plot the histogram where each event from the first bin has weight 0.8, and each event from the second bin has weight -0.2. This numbers are called sWeights.

So, let’s say we had 8000 $e^{-}$ + 2000 $e^{+}$ in first bin and 8000 $e^{+}$ + 2000 $e^{-}$ ($ e^-, e^+$ are electron and positron). After summing with introduced sWeights:

Positrons with positive and negative weights compensated each other, and we will get pure electrons.

At the moment we ignore the normalization of the sWeights (because it doesn’t play any role when we want to reconstruct the shape).

### Compare

Let’s compare reconstructed distribution for electrons with original:

# More complex case

In the case when we have only two ‘bins’ is simple and straightforward. But when there are more than two bins, the solution is not unique. There are many appropriate combinations of sWeights, which one to choose?

But things are more complex in practice. We have not bins, but continuous distributions (which can be treated as many bins).

Typically this is a distribution over mass. By fitting the mass we are able to split the mixture into two parts: signal channel and everything else.

# Building sPlot over mass

Let’s show how this works. First we generate two fake distributions (signal and background) with 2 variables: mass and momentum.

### Of course we don’t have labels which events are signal and which are background.

And we observe the mixture of two distributions:

### We have no information about real labels

But we know a priori that the background is distributed as an exponential distribution and signal as a gaussian (more complex models can be met in practice, but idea is the same).

After fitting the mixture (let me skip this process), we will get the following result:

## Fitting doesn’t give us information about real labels

But it gives information about probabilities, which allows us to estimate the number of signal and background events within each bin.

We won’t use bins, but instead we will get for each event probability that it is signal or background:

## Appying sPlot

sPlot converts probabilities to sWeights, using the implementation from `hep_ml`

:

As you can see, there are also negative sWeights, which are needed to compensate the contributions of other class.

## Using sWeights to reconstruct initial distribution

Let’s check that we achieved our goal and can reconstruct momentum distribution for signal and background:

## Important requirement of sPlot

Reconstructed variable (i.e. $p$) and splotted variable (i.e. mass) shall be statistically independent for each class.

Read the line above again. Reconstructed and splotted variable are correlated:

as a demonstration why this is important let’s use sweights to reconstruct the mass (obviously the mass is correlated with the mass):

$\def\ps{p_s(x)}$ $\def\pb{p_b(x)}$ $\def\ws{sw_s(x)}$ $\def\wb{sw_b(x)}$

# Derivation of sWeights

Now, after we seen how this works, let’s derive the formula for sWeights.

The only information we have from fitting over mass is $ \ps $, $ \pb$ which are probabilities of event $x$ to be signal and background.

Our main goal is to correctly reconstruct histogram. Let’s reconstruct the number of *signal* events in *particular* bin.
Let’s introduce unknown $p_s$ and $p_b$ - probability that signal or background event will be in the named bin.

(Since mass and reconstructed variable are statistically independent for each class, $p_s$ and $p_b$ do not depend on mass.)

The mathematical expectation should be obviously equal to $p_s N_s$, where $N_s$ is total amount of signal events available from fitting.

Let’s also introduce random variable $1_{x \in bin}$, which is 1 iff event $x$ lies in selected bin.

The **estimate for number of signal event in bin** is equal to:
where $\ws$ are sPlot weights and are subject to find.

## First main property of sweights

__Property 1. __ We expect out estimation to be unbiased

**Corollary**
Let’s understand what this means for sPlot weights.

$ p_s N_s = \mathbb{E} \, X = \sum_x w_s \; \mathbb{E} \, 1_{x \in bin} = \sum_x w_s \; (p_s \ps + p_b \pb) $

In the line above I used the assumption that variables are statistically independent for each class.

Since the previous equation should hold for all possible $p_s$ and $p_b$, we get two equalities:

$ p_s N_s = \sum_x \ws \; p_s \ps $

$ 0 = \sum_x \ws \; p_b \pb $

After reduction:

$ N_s = \sum_x \ws \; \ps $

$ 0 = \sum_x \ws \; \pb $

This way we can guarantee that mean input of background are 0 (expectation is zero, but observed number will not be zero due to statistical deviation), and the expected number of background

## Under assumption of linearity:

assuming that splot weight can be computed as linear combination of conditional probabilities:

$ \ws = a_1 \pb + a_2 \ps$

We can easily reconstruct those numbers, first let’s rewrite our system:

$ \sum_x (a_1 \pb + a_2 \ps) \; \ps = 0$

$ \sum_x (a_1 \pb + a_2 \ps) \; \pb = N_{sig}$

$ a_1 V_{bb} + a_2 V_{bs} = 0$

$ a_1 V_{sb} + a_2 V_{ss} = N_{sig}$

Where $V_{ss} = \sum_x \ps \; \ps $, $V_{bs} = V_{sb} = \sum_x \ps \; \pb$, $V_{bb} = \sum_x \pb \; \pb$

Having solved this linear equation, we get needed coefficients (as those in the paper)

NB. There is little difference between $V$ matrix I use and $V$ matrix in the paper.

## Minimization of variation

$\def\Var{\mathbb{V}\,}$

Previous part allows one to get the correct result. But there is still no explanation of reason for linearity.

Apart from having correct mean, we should also minimize variation of any reconstructed variable. Let’s try optimize it

A bit complex, isn’t it? Instead of optimizing such a complex expression (which is individual for each bin), let’s minimize it’s **uniform upper estimate**

so if we are going to minimize this upper estimate, we should solve the following optimization problem with constraints:

$\sum_x \ws^2 \to \min $

$\sum_x \ws \; \pb = 0$

$\sum_x \ws \; \ps = N_{sig}$

Let’s write lagrangian of optimization problem:

After taking derivative with respect to $ \ws $ we get the equality:

which holds for every $x$. Thus, after renaming for convenience $a_1 = - \lambda_1 / 2, $ $a_2 = - \lambda_2 / 2, $ we are getting needed linear dependency.

### Uncorrelatedness

The main assumption we used here is that distribution inside each bin is absolutely indentical.

In other words, we stated that there is no correlation between the index of bin and the reconstructed variable. Remember that bin corresponds to some interval in mass, and finally we get:

**reconstructed variable shall not be correlated with mass variables (or any other splotted variable)**

# Conclusion

- sPlot allows reconstruction of some variables.
- the only information used is probabilities taken from fit over variable. If fact, any probability estimates fit well.
- the source of probabilities should be statistically independent from reconstructed variable (for each class!).
- mixture may contain more than 2 classes (this is supported by
`hep_ml.splot`

as well)

## Sources and code

The code for this post may be found at `hep_ml`

repository.

## Links

A very close explanation was written by Michael Schmelling