Whenever possible, substitute constructions out of known entities for inferences to unknown entities
Occam's razor (Russell's version)

If Russell was studying Machine Learning our days, he’d probably throw out all of the textbooks.

The reason is that ML introduces too many terms with subtle or no difference. It’s hard understand the scale of the problem without a good example.

You can ask your ML-experienced friends how many of the entities are there in this list of terms?

1. mean squared error (MSE)
2. logistic loss
3. cross-entropy
4. Kullback-Leibler (KL) divergence
5. logarithmic scoring rule
7. Brier scoring rule
8. Gini splitting criterion (Gini impurity) for decision tree (not Gini coefficient)
9. entropy splitting criterion for decision tree
10. binomial / multinomial deviance
11. ordinary least squares
12. mutual information
13. logarithmic loss
14. information gain
15. softmax classifier

Not to name some trivial things like sum of squared errors is the same entity as MSE.

This is enough to say that terminology used in ML today is very unfriendly and requires some minimization.

Group 1 contains different synonyms of MSE:

• mean squared error (MSE)
• Brier scoring rule
• Gini splitting criterion (Gini impurity)
• ordinary least squares

It is not obvious that Gini is a particular version of MSE splitting, but I demonstrate this below.

Group 2 contains different versions of cross-entropy:

1. logistic loss
2. cross-entropy
3. Kullback-Leibler (KL) divergence
4. logarithmic scoring rule
5. entropy splitting criterion for decision tree
6. binomial / multinomial deviance
7. mutual information
8. logarithmic loss
9. information gain
10. softmax classifier

Entropy is cross-entropy of distribution with itself. KL-divergence is difference of cross-entropy and entropy.

There are many other ways to define the same concept in particular cases: softmax classifier corresponds to optimization of log-likelihood for Bernoulli distribution (in case of two classes) or categorical distribution (in case of more than two classes).

Non-trivial thing here: entropy splitting criterion is the same as logloss minimization.

## Why Gini splitting = MSE minimization?

Simplest case is binary classification: $y_i = 0 \text{ or } 1$. Let’s first revisit MSE splitting criterion.

$$\text{MSE} = \sum_{i} (y_i - \hat{y}_i)^2$$

Suppose we split the feature space into leaves of a tree $\text{leaf} = 1, 2, . . . , L$.

We are going to group summands according to the leaf each observation belongs to. Penalty for a single leaf depends only on the number of samples $n_\text{leaf, 0}, n_\text{leaf, 1}$ of classes 0 and 1:

$$\text{MSE}_\text{leaf} = \sum_{i \in \text{leaf}} (y_i - \hat{y})^2 = n_{\text{leaf}, 0} \times (0-\hat{y})^2 + n_{\text{leaf}, 1} \times (1 - \hat{y})^2 =$$ $$= \{ \text{select optimal } \hat{y} \} = n_{leaf, 0} \left[ \dfrac{n_{leaf, 0}}{n_{leaf, 0} + n_{leaf, 1}} \right]^2 + n_{leaf, 1} \left[ \dfrac{n_{leaf, 1}}{n_{leaf, 0} + n_{leaf, 1}} \right]^2 =$$ $$= \dfrac{n_{leaf, 0} \times n_{leaf, 1}}{n_{leaf, 0} + n_{leaf, 1}}$$

So, while building decision tree, we greedily optimized the following global function (but each split optimizes only two terms in the sum):

$$\text{MSE} = \sum_{\text{leaf}=1}^{L} \dfrac{n_{\text{leaf}, 0} \times n_{\text{leaf}, 1}}{n_{\text{leaf}, 0} + n_{\text{leaf}, 1}}$$

Let’s also introduce $n_\text{leaf} = n_{\text{leaf}, 0} + n_{\text{leaf}, 1}$ – number of all samples in the leaf and portions of class 1 in leaf: $p_{\text{leaf}, 1} = \dfrac{n_{\text{leaf}, 1}}{n_{\text{leaf}}}$ . Using the new terms we have:

$$\text{MSE} = \sum_{\text{leaf}=1}^{L} n_\text{leaf} \times p_{\text{leaf}, 1} (1 - p_{\text{leaf}, 1}) = \text{Gini impurity}$$

… which is nothing but Gini impurity summed over leaves. It is easy to notice that in the case of classification with multiple classes we need to first perform one-hot encoding and then minimize MSE to get the same splitting rule as provided by Gini.

## Why Entropy splitting = LogLoss minimization?

Let’s start from writing down the LogLoss for the decision tree:

$$\text{LogLoss} = \sum_{i} \sum_{c} y_{i,c} \log p_{i,c}$$

(where $y_{i,c}$ is 1 if event $i$ belongs to class $c$ and 0 otherwise; $p_{i,c}$ is a probability for $t$th sample to belong to class $c$, predicted by a tree).

When selecting optimal predictions in a leaf, we optimize:

$$\text{LogLoss}_\text{leaf} = \sum_{i \in \text{leaf}} \sum_{c} y_{i,c} \log p_{\text{leaf}, c}$$

Easy to prove that an optimal point is $p_{\text{leaf}, c} = \dfrac{n_{\text{leaf}, c}}{n_{\text{leaf}}}$

$$\text{LogLoss}_\text{leaf} = \sum_{c} n_{\text{leaf}, c} \log p_{\text{leaf}, c} = n_{\text{leaf}} \sum_{c} p_{\text{leaf}, c} \log p_{\text{leaf}, c}$$

… and we got entropy impurity (without minus sign, but that’s the only difference).