Gradient boosting¶
Gradient boosting is general-purpose algorithm proposed by Friedman [GB]. It is one of the most efficient machine learning algorithms used for classification, regression and ranking.
The key idea of algorithm is iterative minimization of target loss function by training each time one more estimator to the sequence. In this implementation decision trees are taken as such estimators.
hep_ml provides non-standard loss functions for gradient boosting.
There are for instance, loss functions to fight with correlation or loss functions for ranking.
See hep_ml.losses
for details.
See also libraries: XGBoost, sklearn.ensemble.GradientBoostingClassifier
J.H. Friedman ‘Greedy function approximation: A gradient boosting machine.’, 2001.
- class hep_ml.gradientboosting.AdaLossFunction(regularization=5.0)[source]¶
Bases:
HessianLossFunction
AdaLossFunction is the same as Exponential Loss Function (aka exploss)
- Parameters:
regularization – float, penalty for leaves with few events, corresponds roughly to the number of added events of both classes to each leaf.
- fit(X, y, sample_weight)[source]¶
This method is optional, it is called before all the others. Heavy preprocessing should be done here.
- hessian(y_pred)[source]¶
Returns diagonal of hessian matrix. :param y_pred: numpy.array of shape [n_samples] with events passed in the same order as in fit. :return: numpy.array of shape [n_sampels] with second derivatives with respect to each prediction.
- prepare_tree_params(y_pred)[source]¶
Prepares parameters for regression tree that minimizes MSE
- Parameters:
y_pred – contains predictions for all the events passed to fit method, moreover, the order should be the same
- Returns:
tuple (tree_target, tree_weight) with target and weight to be used in decision tree
- set_fit_request(*, sample_weight: bool | None | str = '$UNCHANGED$') AdaLossFunction ¶
Request metadata passed to the
fit
method.Note that this method is only relevant if
enable_metadata_routing=True
(seesklearn.set_config()
). Please see User Guide on how the routing mechanism works.The options for each parameter are:
True
: metadata is requested, and passed tofit
if provided. The request is ignored if metadata is not provided.False
: metadata is not requested and the meta-estimator will not pass it tofit
.None
: metadata is not requested, and the meta-estimator will raise an error if the user provides it.str
: metadata should be passed to the meta-estimator with this given alias instead of the original name.
The default (
sklearn.utils.metadata_routing.UNCHANGED
) retains the existing request. This allows you to change the request for some parameters and not others.Added in version 1.3.
Note
This method is only relevant if this estimator is used as a sub-estimator of a meta-estimator, e.g. used inside a
Pipeline
. Otherwise it has no effect.Parameters¶
- sample_weightstr, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED
Metadata routing for
sample_weight
parameter infit
.
Returns¶
- selfobject
The updated object.
- class hep_ml.gradientboosting.BinFlatnessLossFunction(uniform_features, uniform_label, n_bins=10, power=2.0, fl_coefficient=3.0, allow_wrong_signs=True)[source]¶
Bases:
AbstractFlatnessLossFunction
This loss function contains separately penalty for non-flatness and for bad prediction quality. See [FL] for details.
\(\text{loss} =\text{ExpLoss} + c \times \text{FlatnessLoss}\)
FlatnessLoss computed using binning of uniform variables
- Parameters:
uniform_features (list[str]) – names of features, along which we want to obtain uniformity of predictions
uniform_label (int|list[int]) – the label(s) of classes for which uniformity is desired
n_bins (int) – number of bins along each variable
power (float) – the loss contains the difference \(| F - F_bin |^p\), where p is power
fl_coefficient (float) – multiplier for flatness_loss. Controls the tradeoff of quality vs uniformity.
allow_wrong_signs (bool) – defines whether gradient may different sign from the “sign of class” (i.e. may have negative gradient on signal). If False, values will be clipped to zero.
[FL]A. Rogozhnikov et al, New approaches for boosting to uniformity http://arxiv.org/abs/1410.4140
- set_fit_request(*, sample_weight: bool | None | str = '$UNCHANGED$') BinFlatnessLossFunction ¶
Request metadata passed to the
fit
method.Note that this method is only relevant if
enable_metadata_routing=True
(seesklearn.set_config()
). Please see User Guide on how the routing mechanism works.The options for each parameter are:
True
: metadata is requested, and passed tofit
if provided. The request is ignored if metadata is not provided.False
: metadata is not requested and the meta-estimator will not pass it tofit
.None
: metadata is not requested, and the meta-estimator will raise an error if the user provides it.str
: metadata should be passed to the meta-estimator with this given alias instead of the original name.
The default (
sklearn.utils.metadata_routing.UNCHANGED
) retains the existing request. This allows you to change the request for some parameters and not others.Added in version 1.3.
Note
This method is only relevant if this estimator is used as a sub-estimator of a meta-estimator, e.g. used inside a
Pipeline
. Otherwise it has no effect.Parameters¶
- sample_weightstr, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED
Metadata routing for
sample_weight
parameter infit
.
Returns¶
- selfobject
The updated object.
- class hep_ml.gradientboosting.KnnAdaLossFunction(uniform_features, uniform_label, knn=10, row_norm=1.0)[source]¶
Bases:
AbstractMatrixLossFunction
Modification of AdaLoss to achieve uniformity of predictions
\(\text{loss} = \sum_i w_i * exp(- \sum_j a_{ij} y_j score_j)\)
A matrix is square, each row corresponds to a single event in train dataset, in each row we put ones to the closest neighbours if this event from uniform class. See [BU] for details.
- Parameters:
uniform_features (list[str]) – the features, along which uniformity is desired
uniform_label (int|list[int]) – the label (labels) of ‘uniform classes’
knn (int) – the number of nonzero elements in the row, corresponding to event in ‘uniform class’
[BU]A. Rogozhnikov et al, New approaches for boosting to uniformity http://arxiv.org/abs/1410.4140
- compute_parameters(trainX, trainY, trainW)[source]¶
This method should be overloaded in descendant, and should return A, w (matrix and vector)
- set_fit_request(*, sample_weight: bool | None | str = '$UNCHANGED$') KnnAdaLossFunction ¶
Request metadata passed to the
fit
method.Note that this method is only relevant if
enable_metadata_routing=True
(seesklearn.set_config()
). Please see User Guide on how the routing mechanism works.The options for each parameter are:
True
: metadata is requested, and passed tofit
if provided. The request is ignored if metadata is not provided.False
: metadata is not requested and the meta-estimator will not pass it tofit
.None
: metadata is not requested, and the meta-estimator will raise an error if the user provides it.str
: metadata should be passed to the meta-estimator with this given alias instead of the original name.
The default (
sklearn.utils.metadata_routing.UNCHANGED
) retains the existing request. This allows you to change the request for some parameters and not others.Added in version 1.3.
Note
This method is only relevant if this estimator is used as a sub-estimator of a meta-estimator, e.g. used inside a
Pipeline
. Otherwise it has no effect.Parameters¶
- sample_weightstr, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED
Metadata routing for
sample_weight
parameter infit
.
Returns¶
- selfobject
The updated object.
- class hep_ml.gradientboosting.KnnFlatnessLossFunction(uniform_features, uniform_label, n_neighbours=100, power=2.0, fl_coefficient=3.0, max_groups=5000, allow_wrong_signs=True, random_state=42)[source]¶
Bases:
AbstractFlatnessLossFunction
This loss function contains separately penalty for non-flatness and for bad prediction quality. See [FL] for details.
\(\text{loss} = \text{ExpLoss} + c \times \text{FlatnessLoss}\)
FlatnessLoss computed using nearest neighbors in space of uniform features
- Parameters:
uniform_features (list[str]) – names of features, along which we want to obtain uniformity of predictions
uniform_label (int|list[int]) – the label(s) of classes for which uniformity is desired
n_neighbours (int) – number of neighbors used in flatness loss
power (float) – the loss contains the difference \(| F - F_bin |^p\), where p is power
fl_coefficient (float) – multiplier for flatness_loss. Controls the tradeoff of quality vs uniformity.
allow_wrong_signs (bool) – defines whether gradient may different sign from the “sign of class” (i.e. may have negative gradient on signal). If False, values will be clipped to zero.
max_groups (int) – to limit memory consumption when training sample is large, we randomly pick this number of points with their members.
[FL]A. Rogozhnikov et al, New approaches for boosting to uniformity http://arxiv.org/abs/1410.4140
- set_fit_request(*, sample_weight: bool | None | str = '$UNCHANGED$') KnnFlatnessLossFunction ¶
Request metadata passed to the
fit
method.Note that this method is only relevant if
enable_metadata_routing=True
(seesklearn.set_config()
). Please see User Guide on how the routing mechanism works.The options for each parameter are:
True
: metadata is requested, and passed tofit
if provided. The request is ignored if metadata is not provided.False
: metadata is not requested and the meta-estimator will not pass it tofit
.None
: metadata is not requested, and the meta-estimator will raise an error if the user provides it.str
: metadata should be passed to the meta-estimator with this given alias instead of the original name.
The default (
sklearn.utils.metadata_routing.UNCHANGED
) retains the existing request. This allows you to change the request for some parameters and not others.Added in version 1.3.
Note
This method is only relevant if this estimator is used as a sub-estimator of a meta-estimator, e.g. used inside a
Pipeline
. Otherwise it has no effect.Parameters¶
- sample_weightstr, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED
Metadata routing for
sample_weight
parameter infit
.
Returns¶
- selfobject
The updated object.
- class hep_ml.gradientboosting.LogLossFunction(regularization=5.0)[source]¶
Bases:
HessianLossFunction
Logistic loss function (logloss), aka binomial deviance, aka cross-entropy, aka log-likelihood loss.
- Parameters:
regularization – float, penalty for leaves with few events, corresponds roughly to the number of added events of both classes to each leaf.
- fit(X, y, sample_weight)[source]¶
This method is optional, it is called before all the others. Heavy preprocessing should be done here.
- hessian(y_pred)[source]¶
Returns diagonal of hessian matrix. :param y_pred: numpy.array of shape [n_samples] with events passed in the same order as in fit. :return: numpy.array of shape [n_sampels] with second derivatives with respect to each prediction.
- prepare_tree_params(y_pred)[source]¶
Prepares parameters for regression tree that minimizes MSE
- Parameters:
y_pred – contains predictions for all the events passed to fit method, moreover, the order should be the same
- Returns:
tuple (tree_target, tree_weight) with target and weight to be used in decision tree
- set_fit_request(*, sample_weight: bool | None | str = '$UNCHANGED$') LogLossFunction ¶
Request metadata passed to the
fit
method.Note that this method is only relevant if
enable_metadata_routing=True
(seesklearn.set_config()
). Please see User Guide on how the routing mechanism works.The options for each parameter are:
True
: metadata is requested, and passed tofit
if provided. The request is ignored if metadata is not provided.False
: metadata is not requested and the meta-estimator will not pass it tofit
.None
: metadata is not requested, and the meta-estimator will raise an error if the user provides it.str
: metadata should be passed to the meta-estimator with this given alias instead of the original name.
The default (
sklearn.utils.metadata_routing.UNCHANGED
) retains the existing request. This allows you to change the request for some parameters and not others.Added in version 1.3.
Note
This method is only relevant if this estimator is used as a sub-estimator of a meta-estimator, e.g. used inside a
Pipeline
. Otherwise it has no effect.Parameters¶
- sample_weightstr, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED
Metadata routing for
sample_weight
parameter infit
.
Returns¶
- selfobject
The updated object.
- class hep_ml.gradientboosting.RankBoostLossFunction(request_column, penalty_power=1.0, update_iterations=1)[source]¶
Bases:
HessianLossFunction
RankBoostLossFunction is target of optimization in RankBoost [RB] algorithm, which was developed for ranking and introduces penalties for wrong order of predictions.
However, this implementation goes further and there is selection of optimal leaf values based on iterative procedure. This implementation also uses matrix decomposition of loss function, which is very effective, when labels are from some very limited set (usually it is 0, 1, 2, 3, 4)
\(\text{loss} = \sum_{ij} w_{ij} exp(pred_i - pred_j)\),
\(w_{ij} = ( \alpha + \beta * [query_i = query_j]) R_{label_i, label_j}\), where \(R_{ij} = 0\) if \(i \leq j\), else \(R_{ij} = (i - j)^{p}\)
- Parameters:
request_column (str) – name of column with search query ids. The higher attention is payed to samples with same query.
penalty_power (float) – describes dependence of penalty on the difference between target labels.
update_iterations (int) – number of minimization steps to provide optimal values in leaves.
[RB]Freund et al. An Efficient Boosting Algorithm for Combining Preferences
- fit(X, y, sample_weight)[source]¶
This method is optional, it is called before all the others. Heavy preprocessing should be done here.
- hessian(y_pred)[source]¶
Returns diagonal of hessian matrix. :param y_pred: numpy.array of shape [n_samples] with events passed in the same order as in fit. :return: numpy.array of shape [n_sampels] with second derivatives with respect to each prediction.
- prepare_new_leaves_values(terminal_regions, leaf_values, y_pred)[source]¶
This expression comes from optimization of second-order approximation of loss function.
- set_fit_request(*, sample_weight: bool | None | str = '$UNCHANGED$') RankBoostLossFunction ¶
Request metadata passed to the
fit
method.Note that this method is only relevant if
enable_metadata_routing=True
(seesklearn.set_config()
). Please see User Guide on how the routing mechanism works.The options for each parameter are:
True
: metadata is requested, and passed tofit
if provided. The request is ignored if metadata is not provided.False
: metadata is not requested and the meta-estimator will not pass it tofit
.None
: metadata is not requested, and the meta-estimator will raise an error if the user provides it.str
: metadata should be passed to the meta-estimator with this given alias instead of the original name.
The default (
sklearn.utils.metadata_routing.UNCHANGED
) retains the existing request. This allows you to change the request for some parameters and not others.Added in version 1.3.
Note
This method is only relevant if this estimator is used as a sub-estimator of a meta-estimator, e.g. used inside a
Pipeline
. Otherwise it has no effect.Parameters¶
- sample_weightstr, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED
Metadata routing for
sample_weight
parameter infit
.
Returns¶
- selfobject
The updated object.
- class hep_ml.gradientboosting.UGradientBoostingClassifier(loss=None, n_estimators=100, learning_rate=0.1, subsample=1.0, min_samples_split=2, min_samples_leaf=1, max_features=None, max_leaf_nodes=None, max_depth=3, splitter='best', update_tree=True, train_features=None, random_state=None)[source]¶
Bases:
UGradientBoostingBase
,ClassifierMixin
This version of gradient boosting supports only two-class classification and only special losses derived from AbstractLossFunction.
max_depth, max_leaf_nodes, min_samples_leaf, min_samples_split, max_features are parameters of regression tree, which is used as base estimator.
- Parameters:
loss (AbstractLossFunction) – any descendant of AbstractLossFunction, those are very various. See
hep_ml.losses
for available losses.n_estimators (int) – number of trained trees.
subsample (float) – fraction of data to use on each stage
learning_rate (float) – size of step.
update_tree (bool) – True by default. If False, ‘improvement’ step after fitting tree will be skipped.
train_features – features used by tree. Note that algorithm may require also variables used by loss function, but not listed here.
- fit(X, y, sample_weight=None)[source]¶
Train formula. Only two-class binary classification is supported with labels 0 and 1.
- Parameters:
X – dataset of shape [n_samples, n_features]
y – labels, array-like of shape [n_samples]
sample_weight – array-like of shape [n_samples] or None
- Returns:
self
- predict(X)[source]¶
Predicted classes for each event
- Parameters:
X – pandas.DataFrame with all train_features
- Returns:
numpy.array of shape [n_samples] with predicted classes.
- predict_proba(X)[source]¶
Predicted probabilities for each event
- Parameters:
X – pandas.DataFrame with all train_features
- Returns:
numpy.array of shape [n_samples, n_classes]
- set_fit_request(*, sample_weight: bool | None | str = '$UNCHANGED$') UGradientBoostingClassifier ¶
Request metadata passed to the
fit
method.Note that this method is only relevant if
enable_metadata_routing=True
(seesklearn.set_config()
). Please see User Guide on how the routing mechanism works.The options for each parameter are:
True
: metadata is requested, and passed tofit
if provided. The request is ignored if metadata is not provided.False
: metadata is not requested and the meta-estimator will not pass it tofit
.None
: metadata is not requested, and the meta-estimator will raise an error if the user provides it.str
: metadata should be passed to the meta-estimator with this given alias instead of the original name.
The default (
sklearn.utils.metadata_routing.UNCHANGED
) retains the existing request. This allows you to change the request for some parameters and not others.Added in version 1.3.
Note
This method is only relevant if this estimator is used as a sub-estimator of a meta-estimator, e.g. used inside a
Pipeline
. Otherwise it has no effect.Parameters¶
- sample_weightstr, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED
Metadata routing for
sample_weight
parameter infit
.
Returns¶
- selfobject
The updated object.
- set_score_request(*, sample_weight: bool | None | str = '$UNCHANGED$') UGradientBoostingClassifier ¶
Request metadata passed to the
score
method.Note that this method is only relevant if
enable_metadata_routing=True
(seesklearn.set_config()
). Please see User Guide on how the routing mechanism works.The options for each parameter are:
True
: metadata is requested, and passed toscore
if provided. The request is ignored if metadata is not provided.False
: metadata is not requested and the meta-estimator will not pass it toscore
.None
: metadata is not requested, and the meta-estimator will raise an error if the user provides it.str
: metadata should be passed to the meta-estimator with this given alias instead of the original name.
The default (
sklearn.utils.metadata_routing.UNCHANGED
) retains the existing request. This allows you to change the request for some parameters and not others.Added in version 1.3.
Note
This method is only relevant if this estimator is used as a sub-estimator of a meta-estimator, e.g. used inside a
Pipeline
. Otherwise it has no effect.Parameters¶
- sample_weightstr, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED
Metadata routing for
sample_weight
parameter inscore
.
Returns¶
- selfobject
The updated object.
- class hep_ml.gradientboosting.UGradientBoostingRegressor(loss=None, n_estimators=100, learning_rate=0.1, subsample=1.0, min_samples_split=2, min_samples_leaf=1, max_features=None, max_leaf_nodes=None, max_depth=3, splitter='best', update_tree=True, train_features=None, random_state=None)[source]¶
Bases:
UGradientBoostingBase
,RegressorMixin
Gradient Boosted regressor. Approximates target by sum of predictions of several trees.
max_depth, max_leaf_nodes, min_samples_leaf, min_samples_split, max_features are parameters of regression tree, which is used as base estimator.
- Parameters:
loss (AbstractLossFunction) – any descendant of AbstractLossFunction, those are very various. See
hep_ml.losses
for available losses.n_estimators (int) – number of trained trees.
subsample (float) – fraction of data to use on each stage
learning_rate (float) – size of step.
update_tree (bool) – True by default. If False, ‘improvement’ step after fitting tree will be skipped.
train_features – features used by tree. Note that algorithm may require also variables used by loss function, but not listed here.
- fit(X, y, sample_weight=None)[source]¶
Fit estimator.
- Parameters:
X – dataset of shape [n_samples, n_features]
y – target values, array-like of shape [n_samples]
sample_weight – array-like of shape [n_samples] or None
- Returns:
self
- predict(X)[source]¶
Predict values for new samples
- Parameters:
X – pandas.DataFrame with all train_features
- Returns:
numpy.array of shape [n_samples]
- set_fit_request(*, sample_weight: bool | None | str = '$UNCHANGED$') UGradientBoostingRegressor ¶
Request metadata passed to the
fit
method.Note that this method is only relevant if
enable_metadata_routing=True
(seesklearn.set_config()
). Please see User Guide on how the routing mechanism works.The options for each parameter are:
True
: metadata is requested, and passed tofit
if provided. The request is ignored if metadata is not provided.False
: metadata is not requested and the meta-estimator will not pass it tofit
.None
: metadata is not requested, and the meta-estimator will raise an error if the user provides it.str
: metadata should be passed to the meta-estimator with this given alias instead of the original name.
The default (
sklearn.utils.metadata_routing.UNCHANGED
) retains the existing request. This allows you to change the request for some parameters and not others.Added in version 1.3.
Note
This method is only relevant if this estimator is used as a sub-estimator of a meta-estimator, e.g. used inside a
Pipeline
. Otherwise it has no effect.Parameters¶
- sample_weightstr, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED
Metadata routing for
sample_weight
parameter infit
.
Returns¶
- selfobject
The updated object.
- set_score_request(*, sample_weight: bool | None | str = '$UNCHANGED$') UGradientBoostingRegressor ¶
Request metadata passed to the
score
method.Note that this method is only relevant if
enable_metadata_routing=True
(seesklearn.set_config()
). Please see User Guide on how the routing mechanism works.The options for each parameter are:
True
: metadata is requested, and passed toscore
if provided. The request is ignored if metadata is not provided.False
: metadata is not requested and the meta-estimator will not pass it toscore
.None
: metadata is not requested, and the meta-estimator will raise an error if the user provides it.str
: metadata should be passed to the meta-estimator with this given alias instead of the original name.
The default (
sklearn.utils.metadata_routing.UNCHANGED
) retains the existing request. This allows you to change the request for some parameters and not others.Added in version 1.3.
Note
This method is only relevant if this estimator is used as a sub-estimator of a meta-estimator, e.g. used inside a
Pipeline
. Otherwise it has no effect.Parameters¶
- sample_weightstr, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED
Metadata routing for
sample_weight
parameter inscore
.
Returns¶
- selfobject
The updated object.