Gradient boosting

Gradient boosting is general-purpose algorithm proposed by Friedman [GB]. It is one of the most efficient machine learning algorithms used for classification, regression and ranking.

The key idea of algorithm is iterative minimization of target loss function by training each time one more estimator to the sequence. In this implementation decision trees are taken as such estimators.

hep_ml provides non-standard loss functions for gradient boosting. There are for instance, loss functions to fight with correlation or loss functions for ranking. See hep_ml.losses for details.

See also libraries: XGBoost, sklearn.ensemble.GradientBoostingClassifier

GB

J.H. Friedman ‘Greedy function approximation: A gradient boosting machine.’, 2001.

class hep_ml.gradientboosting.UGradientBoostingClassifier(loss=None, n_estimators=100, learning_rate=0.1, subsample=1.0, min_samples_split=2, min_samples_leaf=1, max_features=None, max_leaf_nodes=None, max_depth=3, splitter='best', update_tree=True, train_features=None, random_state=None)[source]

Bases: hep_ml.gradientboosting.UGradientBoostingBase, sklearn.base.ClassifierMixin

This version of gradient boosting supports only two-class classification and only special losses derived from AbstractLossFunction.

max_depth, max_leaf_nodes, min_samples_leaf, min_samples_split, max_features are parameters of regression tree, which is used as base estimator.

Parameters
  • loss (AbstractLossFunction) – any descendant of AbstractLossFunction, those are very various. See hep_ml.losses for available losses.

  • n_estimators (int) – number of trained trees.

  • subsample (float) – fraction of data to use on each stage

  • learning_rate (float) – size of step.

  • update_tree (bool) – True by default. If False, ‘improvement’ step after fitting tree will be skipped.

  • train_features – features used by tree. Note that algorithm may require also variables used by loss function, but not listed here.

fit(X, y, sample_weight=None)[source]

Train formula. Only two-class binary classification is supported with labels 0 and 1.

Parameters
  • X – dataset of shape [n_samples, n_features]

  • y – labels, array-like of shape [n_samples]

  • sample_weight – array-like of shape [n_samples] or None

Returns

self

predict(X)[source]

Predicted classes for each event

Parameters

X – pandas.DataFrame with all train_features

Returns

numpy.array of shape [n_samples] with predicted classes.

predict_proba(X)[source]

Predicted probabilities for each event

Parameters

X – pandas.DataFrame with all train_features

Returns

numpy.array of shape [n_samples, n_classes]

staged_predict_proba(X)[source]

Predicted probabilities for each event

Parameters

X – data

Returns

sequence of numpy.array of shape [n_samples, n_classes]

class hep_ml.gradientboosting.UGradientBoostingRegressor(loss=None, n_estimators=100, learning_rate=0.1, subsample=1.0, min_samples_split=2, min_samples_leaf=1, max_features=None, max_leaf_nodes=None, max_depth=3, splitter='best', update_tree=True, train_features=None, random_state=None)[source]

Bases: hep_ml.gradientboosting.UGradientBoostingBase, sklearn.base.RegressorMixin

Gradient Boosted regressor. Approximates target by sum of predictions of several trees.

max_depth, max_leaf_nodes, min_samples_leaf, min_samples_split, max_features are parameters of regression tree, which is used as base estimator.

Parameters
  • loss (AbstractLossFunction) – any descendant of AbstractLossFunction, those are very various. See hep_ml.losses for available losses.

  • n_estimators (int) – number of trained trees.

  • subsample (float) – fraction of data to use on each stage

  • learning_rate (float) – size of step.

  • update_tree (bool) – True by default. If False, ‘improvement’ step after fitting tree will be skipped.

  • train_features – features used by tree. Note that algorithm may require also variables used by loss function, but not listed here.

fit(X, y, sample_weight=None)[source]

Fit estimator.

Parameters
  • X – dataset of shape [n_samples, n_features]

  • y – target values, array-like of shape [n_samples]

  • sample_weight – array-like of shape [n_samples] or None

Returns

self

predict(X)[source]

Predict values for new samples

Parameters

X – pandas.DataFrame with all train_features

Returns

numpy.array of shape [n_samples]

staged_predict(X)[source]

Return predictions after each new tree

Parameters

X – data

Returns

sequence of numpy.array of shape [n_samples]