Gradient boosting¶

Gradient boosting is general-purpose algorithm proposed by Friedman [GB]. It is one of the most efficient machine learning algorithms used for classification, regression and ranking.

The key idea of algorithm is iterative minimization of target loss function by training each time one more estimator to the sequence. In this implementation decision trees are taken as such estimators.

hep_ml provides non-standard loss functions for gradient boosting. There are for instance, loss functions to fight with correlation or loss functions for ranking. See hep_ml.losses for details.

See also libraries: XGBoost, sklearn.ensemble.GradientBoostingClassifier

[GB]

J.H. Friedman ‘Greedy function approximation: A gradient boosting machine.’, 2001.

class hep_ml.gradientboosting.AdaLossFunction(regularization=5.0)[source]¶

Bases: HessianLossFunction

AdaLossFunction is the same as Exponential Loss Function (aka exploss)

Parameters:: regularization – float, penalty for leaves with few events, corresponds roughly to the number of added events of both classes to each leaf.

fit(X, y, sample_weight)[source]¶: This method is optional, it is called before all the others. Heavy preprocessing should be done here.

hessian(y_pred)[source]¶: Returns diagonal of hessian matrix. :param y_pred: numpy.array of shape [n_samples] with events passed in the same order as in fit. :return: numpy.array of shape [n_sampels] with second derivatives with respect to each prediction.

negative_gradient(y_pred)[source]¶

prepare_tree_params(y_pred)[source]¶

Prepares parameters for regression tree that minimizes MSE

Parameters:: y_pred – contains predictions for all the events passed to fit method, moreover, the order should be the same
Returns:: tuple (tree_target, tree_weight) with target and weight to be used in decision tree

set_fit_request(*, sample_weight: bool | None | str = '$UNCHANGED$') → AdaLossFunction¶

Request metadata passed to the fit method.

Note that this method is only relevant if enable_metadata_routing=True (see sklearn.set_config()). Please see User Guide on how the routing mechanism works.

The options for each parameter are:

True: metadata is requested, and passed to fit if provided. The request is ignored if metadata is not provided.
False: metadata is not requested and the meta-estimator will not pass it to fit.
None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.
str: metadata should be passed to the meta-estimator with this given alias instead of the original name.

The default (sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.

Added in version 1.3.

Note

This method is only relevant if this estimator is used as a sub-estimator of a meta-estimator, e.g. used inside a Pipeline. Otherwise it has no effect.

Parameters¶

sample_weightstr, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED: Metadata routing for sample_weight parameter in fit.

Returns¶

selfobject: The updated object.

class hep_ml.gradientboosting.BinFlatnessLossFunction(uniform_features, uniform_label, n_bins=10, power=2.0, fl_coefficient=3.0, allow_wrong_signs=True)[source]¶

Bases: AbstractFlatnessLossFunction

This loss function contains separately penalty for non-flatness and for bad prediction quality. See [FL] for details.

$\text{loss} =\text{ExpLoss} + c \times \text{FlatnessLoss}$

FlatnessLoss computed using binning of uniform variables

Parameters:

uniform_features (list[str]) – names of features, along which we want to obtain uniformity of predictions
uniform_label (int|list[int]) – the label(s) of classes for which uniformity is desired
n_bins (int) – number of bins along each variable
power (float) – the loss contains the difference $| F - F_bin |^p$, where p is power
fl_coefficient (float) – multiplier for flatness_loss. Controls the tradeoff of quality vs uniformity.
allow_wrong_signs (bool) – defines whether gradient may different sign from the “sign of class” (i.e. may have negative gradient on signal). If False, values will be clipped to zero.

[FL]

A. Rogozhnikov et al, New approaches for boosting to uniformity http://arxiv.org/abs/1410.4140

set_fit_request(*, sample_weight: bool | None | str = '$UNCHANGED$') → BinFlatnessLossFunction¶

Request metadata passed to the fit method.

Note that this method is only relevant if enable_metadata_routing=True (see sklearn.set_config()). Please see User Guide on how the routing mechanism works.

The options for each parameter are:

True: metadata is requested, and passed to fit if provided. The request is ignored if metadata is not provided.
False: metadata is not requested and the meta-estimator will not pass it to fit.
None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.
str: metadata should be passed to the meta-estimator with this given alias instead of the original name.

The default (sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.

Added in version 1.3.

Note

This method is only relevant if this estimator is used as a sub-estimator of a meta-estimator, e.g. used inside a Pipeline. Otherwise it has no effect.

Parameters¶

sample_weightstr, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED: Metadata routing for sample_weight parameter in fit.

Returns¶

selfobject: The updated object.

class hep_ml.gradientboosting.KnnAdaLossFunction(uniform_features, uniform_label, knn=10, row_norm=1.0)[source]¶

Bases: AbstractMatrixLossFunction

Modification of AdaLoss to achieve uniformity of predictions

$\text{loss} = \sum_i w_i * exp(- \sum_j a_{ij} y_j score_j)$

A matrix is square, each row corresponds to a single event in train dataset, in each row we put ones to the closest neighbours if this event from uniform class. See [BU] for details.

Parameters:

uniform_features (list[str]) – the features, along which uniformity is desired
uniform_label (int|list[int]) – the label (labels) of ‘uniform classes’
knn (int) – the number of nonzero elements in the row, corresponding to event in ‘uniform class’

[BU]

A. Rogozhnikov et al, New approaches for boosting to uniformity http://arxiv.org/abs/1410.4140

compute_parameters(trainX, trainY, trainW)[source]¶: This method should be overloaded in descendant, and should return A, w (matrix and vector)

set_fit_request(*, sample_weight: bool | None | str = '$UNCHANGED$') → KnnAdaLossFunction¶

Request metadata passed to the fit method.

Note that this method is only relevant if enable_metadata_routing=True (see sklearn.set_config()). Please see User Guide on how the routing mechanism works.

The options for each parameter are:

True: metadata is requested, and passed to fit if provided. The request is ignored if metadata is not provided.
False: metadata is not requested and the meta-estimator will not pass it to fit.
None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.
str: metadata should be passed to the meta-estimator with this given alias instead of the original name.

The default (sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.

Added in version 1.3.

Note

This method is only relevant if this estimator is used as a sub-estimator of a meta-estimator, e.g. used inside a Pipeline. Otherwise it has no effect.

Parameters¶

sample_weightstr, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED: Metadata routing for sample_weight parameter in fit.

Returns¶

selfobject: The updated object.

class hep_ml.gradientboosting.KnnFlatnessLossFunction(uniform_features, uniform_label, n_neighbours=100, power=2.0, fl_coefficient=3.0, max_groups=5000, allow_wrong_signs=True, random_state=42)[source]¶

Bases: AbstractFlatnessLossFunction

This loss function contains separately penalty for non-flatness and for bad prediction quality. See [FL] for details.

$\text{loss} = \text{ExpLoss} + c \times \text{FlatnessLoss}$

FlatnessLoss computed using nearest neighbors in space of uniform features

Parameters:

uniform_features (list[str]) – names of features, along which we want to obtain uniformity of predictions
uniform_label (int|list[int]) – the label(s) of classes for which uniformity is desired
n_neighbours (int) – number of neighbors used in flatness loss
power (float) – the loss contains the difference $| F - F_bin |^p$, where p is power
fl_coefficient (float) – multiplier for flatness_loss. Controls the tradeoff of quality vs uniformity.
allow_wrong_signs (bool) – defines whether gradient may different sign from the “sign of class” (i.e. may have negative gradient on signal). If False, values will be clipped to zero.
max_groups (int) – to limit memory consumption when training sample is large, we randomly pick this number of points with their members.

[FL]

A. Rogozhnikov et al, New approaches for boosting to uniformity http://arxiv.org/abs/1410.4140

set_fit_request(*, sample_weight: bool | None | str = '$UNCHANGED$') → KnnFlatnessLossFunction¶

Request metadata passed to the fit method.

Note that this method is only relevant if enable_metadata_routing=True (see sklearn.set_config()). Please see User Guide on how the routing mechanism works.

The options for each parameter are:

True: metadata is requested, and passed to fit if provided. The request is ignored if metadata is not provided.
False: metadata is not requested and the meta-estimator will not pass it to fit.
None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.
str: metadata should be passed to the meta-estimator with this given alias instead of the original name.

The default (sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.

Added in version 1.3.

Note

This method is only relevant if this estimator is used as a sub-estimator of a meta-estimator, e.g. used inside a Pipeline. Otherwise it has no effect.

Parameters¶

sample_weightstr, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED: Metadata routing for sample_weight parameter in fit.

Returns¶

selfobject: The updated object.

class hep_ml.gradientboosting.LogLossFunction(regularization=5.0)[source]¶

Bases: HessianLossFunction

Logistic loss function (logloss), aka binomial deviance, aka cross-entropy, aka log-likelihood loss.

Parameters:: regularization – float, penalty for leaves with few events, corresponds roughly to the number of added events of both classes to each leaf.

fit(X, y, sample_weight)[source]¶: This method is optional, it is called before all the others. Heavy preprocessing should be done here.

hessian(y_pred)[source]¶: Returns diagonal of hessian matrix. :param y_pred: numpy.array of shape [n_samples] with events passed in the same order as in fit. :return: numpy.array of shape [n_sampels] with second derivatives with respect to each prediction.

negative_gradient(y_pred)[source]¶

prepare_tree_params(y_pred)[source]¶

Prepares parameters for regression tree that minimizes MSE

Parameters:: y_pred – contains predictions for all the events passed to fit method, moreover, the order should be the same
Returns:: tuple (tree_target, tree_weight) with target and weight to be used in decision tree

set_fit_request(*, sample_weight: bool | None | str = '$UNCHANGED$') → LogLossFunction¶

Request metadata passed to the fit method.

Note that this method is only relevant if enable_metadata_routing=True (see sklearn.set_config()). Please see User Guide on how the routing mechanism works.

The options for each parameter are:

True: metadata is requested, and passed to fit if provided. The request is ignored if metadata is not provided.
False: metadata is not requested and the meta-estimator will not pass it to fit.
None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.
str: metadata should be passed to the meta-estimator with this given alias instead of the original name.

The default (sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.

Added in version 1.3.

Note

This method is only relevant if this estimator is used as a sub-estimator of a meta-estimator, e.g. used inside a Pipeline. Otherwise it has no effect.

Parameters¶

sample_weightstr, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED: Metadata routing for sample_weight parameter in fit.

Returns¶

selfobject: The updated object.

class hep_ml.gradientboosting.RankBoostLossFunction(request_column, penalty_power=1.0, update_iterations=1)[source]¶

Bases: HessianLossFunction

RankBoostLossFunction is target of optimization in RankBoost [RB] algorithm, which was developed for ranking and introduces penalties for wrong order of predictions.

However, this implementation goes further and there is selection of optimal leaf values based on iterative procedure. This implementation also uses matrix decomposition of loss function, which is very effective, when labels are from some very limited set (usually it is 0, 1, 2, 3, 4)

$\text{loss} = \sum_{ij} w_{ij} exp(pred_i - pred_j)$,

$w_{ij} = ( \alpha + \beta * [query_i = query_j]) R_{label_i, label_j}$, where $R_{ij} = 0$ if $i \leq j$, else $R_{ij} = (i - j)^{p}$

Parameters:

request_column (str) – name of column with search query ids. The higher attention is payed to samples with same query.
penalty_power (float) – describes dependence of penalty on the difference between target labels.
update_iterations (int) – number of minimization steps to provide optimal values in leaves.

[RB]

Freund et al. An Efficient Boosting Algorithm for Combining Preferences

fit(X, y, sample_weight)[source]¶: This method is optional, it is called before all the others. Heavy preprocessing should be done here.

hessian(y_pred)[source]¶: Returns diagonal of hessian matrix. :param y_pred: numpy.array of shape [n_samples] with events passed in the same order as in fit. :return: numpy.array of shape [n_sampels] with second derivatives with respect to each prediction.

negative_gradient(y_pred)[source]¶

prepare_new_leaves_values(terminal_regions, leaf_values, y_pred)[source]¶: This expression comes from optimization of second-order approximation of loss function.

set_fit_request(*, sample_weight: bool | None | str = '$UNCHANGED$') → RankBoostLossFunction¶

Request metadata passed to the fit method.

Note that this method is only relevant if enable_metadata_routing=True (see sklearn.set_config()). Please see User Guide on how the routing mechanism works.

The options for each parameter are:

True: metadata is requested, and passed to fit if provided. The request is ignored if metadata is not provided.
False: metadata is not requested and the meta-estimator will not pass it to fit.
None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.
str: metadata should be passed to the meta-estimator with this given alias instead of the original name.

The default (sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.

Added in version 1.3.

Note

This method is only relevant if this estimator is used as a sub-estimator of a meta-estimator, e.g. used inside a Pipeline. Otherwise it has no effect.

Parameters¶

sample_weightstr, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED: Metadata routing for sample_weight parameter in fit.

Returns¶

selfobject: The updated object.

class hep_ml.gradientboosting.UGradientBoostingClassifier(loss=None, n_estimators=100, learning_rate=0.1, subsample=1.0, min_samples_split=2, min_samples_leaf=1, max_features=None, max_leaf_nodes=None, max_depth=3, splitter='best', update_tree=True, train_features=None, random_state=None)[source]¶

Bases: UGradientBoostingBase, ClassifierMixin

This version of gradient boosting supports only two-class classification and only special losses derived from AbstractLossFunction.

max_depth, max_leaf_nodes, min_samples_leaf, min_samples_split, max_features are parameters of regression tree, which is used as base estimator.

Parameters:

loss (AbstractLossFunction) – any descendant of AbstractLossFunction, those are very various. See hep_ml.losses for available losses.
n_estimators (int) – number of trained trees.
subsample (float) – fraction of data to use on each stage
learning_rate (float) – size of step.
update_tree (bool) – True by default. If False, ‘improvement’ step after fitting tree will be skipped.
train_features – features used by tree. Note that algorithm may require also variables used by loss function, but not listed here.

fit(X, y, sample_weight=None)[source]¶

Train formula. Only two-class binary classification is supported with labels 0 and 1.

Parameters:

X – dataset of shape [n_samples, n_features]
y – labels, array-like of shape [n_samples]
sample_weight – array-like of shape [n_samples] or None

Returns:

self

predict(X)[source]¶

Predicted classes for each event

Parameters:: X – pandas.DataFrame with all train_features
Returns:: numpy.array of shape [n_samples] with predicted classes.

predict_proba(X)[source]¶

Predicted probabilities for each event

Parameters:: X – pandas.DataFrame with all train_features
Returns:: numpy.array of shape [n_samples, n_classes]

set_fit_request(*, sample_weight: bool | None | str = '$UNCHANGED$') → UGradientBoostingClassifier¶

Request metadata passed to the fit method.

Note that this method is only relevant if enable_metadata_routing=True (see sklearn.set_config()). Please see User Guide on how the routing mechanism works.

The options for each parameter are:

True: metadata is requested, and passed to fit if provided. The request is ignored if metadata is not provided.
False: metadata is not requested and the meta-estimator will not pass it to fit.
None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.
str: metadata should be passed to the meta-estimator with this given alias instead of the original name.

The default (sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.

Added in version 1.3.

Note

This method is only relevant if this estimator is used as a sub-estimator of a meta-estimator, e.g. used inside a Pipeline. Otherwise it has no effect.

Parameters¶

sample_weightstr, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED: Metadata routing for sample_weight parameter in fit.

Returns¶

selfobject: The updated object.

set_score_request(*, sample_weight: bool | None | str = '$UNCHANGED$') → UGradientBoostingClassifier¶

Request metadata passed to the score method.

Note that this method is only relevant if enable_metadata_routing=True (see sklearn.set_config()). Please see User Guide on how the routing mechanism works.

The options for each parameter are:

True: metadata is requested, and passed to score if provided. The request is ignored if metadata is not provided.
False: metadata is not requested and the meta-estimator will not pass it to score.
None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.
str: metadata should be passed to the meta-estimator with this given alias instead of the original name.

The default (sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.

Added in version 1.3.

Note

This method is only relevant if this estimator is used as a sub-estimator of a meta-estimator, e.g. used inside a Pipeline. Otherwise it has no effect.

Parameters¶

sample_weightstr, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED: Metadata routing for sample_weight parameter in score.

Returns¶

selfobject: The updated object.

staged_predict_proba(X)[source]¶

Predicted probabilities for each event

Parameters:: X – data
Returns:: sequence of numpy.array of shape [n_samples, n_classes]

class hep_ml.gradientboosting.UGradientBoostingRegressor(loss=None, n_estimators=100, learning_rate=0.1, subsample=1.0, min_samples_split=2, min_samples_leaf=1, max_features=None, max_leaf_nodes=None, max_depth=3, splitter='best', update_tree=True, train_features=None, random_state=None)[source]¶

Bases: UGradientBoostingBase, RegressorMixin

Gradient Boosted regressor. Approximates target by sum of predictions of several trees.

max_depth, max_leaf_nodes, min_samples_leaf, min_samples_split, max_features are parameters of regression tree, which is used as base estimator.

Parameters:

loss (AbstractLossFunction) – any descendant of AbstractLossFunction, those are very various. See hep_ml.losses for available losses.
n_estimators (int) – number of trained trees.
subsample (float) – fraction of data to use on each stage
learning_rate (float) – size of step.
update_tree (bool) – True by default. If False, ‘improvement’ step after fitting tree will be skipped.
train_features – features used by tree. Note that algorithm may require also variables used by loss function, but not listed here.

fit(X, y, sample_weight=None)[source]¶

Fit estimator.

Parameters:

X – dataset of shape [n_samples, n_features]
y – target values, array-like of shape [n_samples]
sample_weight – array-like of shape [n_samples] or None

Returns:

self

predict(X)[source]¶

Predict values for new samples

Parameters:: X – pandas.DataFrame with all train_features
Returns:: numpy.array of shape [n_samples]

set_fit_request(*, sample_weight: bool | None | str = '$UNCHANGED$') → UGradientBoostingRegressor¶

Request metadata passed to the fit method.

Note that this method is only relevant if enable_metadata_routing=True (see sklearn.set_config()). Please see User Guide on how the routing mechanism works.

The options for each parameter are:

True: metadata is requested, and passed to fit if provided. The request is ignored if metadata is not provided.
False: metadata is not requested and the meta-estimator will not pass it to fit.
None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.
str: metadata should be passed to the meta-estimator with this given alias instead of the original name.

The default (sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.

Added in version 1.3.

Note

This method is only relevant if this estimator is used as a sub-estimator of a meta-estimator, e.g. used inside a Pipeline. Otherwise it has no effect.

Parameters¶

sample_weightstr, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED: Metadata routing for sample_weight parameter in fit.

Returns¶

selfobject: The updated object.

set_score_request(*, sample_weight: bool | None | str = '$UNCHANGED$') → UGradientBoostingRegressor¶

Request metadata passed to the score method.

Note that this method is only relevant if enable_metadata_routing=True (see sklearn.set_config()). Please see User Guide on how the routing mechanism works.

The options for each parameter are:

True: metadata is requested, and passed to score if provided. The request is ignored if metadata is not provided.
False: metadata is not requested and the meta-estimator will not pass it to score.
None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.
str: metadata should be passed to the meta-estimator with this given alias instead of the original name.

The default (sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.

Added in version 1.3.

Note

This method is only relevant if this estimator is used as a sub-estimator of a meta-estimator, e.g. used inside a Pipeline. Otherwise it has no effect.

Parameters¶

sample_weightstr, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED: Metadata routing for sample_weight parameter in score.

Returns¶

selfobject: The updated object.

staged_predict(X)[source]¶

Return predictions after each new tree

Parameters:: X – data
Returns:: sequence of numpy.array of shape [n_samples]