Gradient boosting¶
Gradient boosting is general-purpose algorithm proposed by Friedman [GB]. It is one of the most efficient machine learning algorithms used for classification, regression and ranking.
The key idea of algorithm is iterative minimization of target loss function by training each time one more estimator to the sequence. In this implementation decision trees are taken as such estimators.
hep_ml provides non-standard loss functions for gradient boosting.
There are for instance, loss functions to fight with correlation or loss functions for ranking.
See hep_ml.losses
for details.
See also libraries: XGBoost, sklearn.ensemble.GradientBoostingClassifier
- GB
J.H. Friedman ‘Greedy function approximation: A gradient boosting machine.’, 2001.
- class hep_ml.gradientboosting.UGradientBoostingClassifier(loss=None, n_estimators=100, learning_rate=0.1, subsample=1.0, min_samples_split=2, min_samples_leaf=1, max_features=None, max_leaf_nodes=None, max_depth=3, splitter='best', update_tree=True, train_features=None, random_state=None)[source]¶
Bases:
hep_ml.gradientboosting.UGradientBoostingBase
,sklearn.base.ClassifierMixin
This version of gradient boosting supports only two-class classification and only special losses derived from AbstractLossFunction.
max_depth, max_leaf_nodes, min_samples_leaf, min_samples_split, max_features are parameters of regression tree, which is used as base estimator.
- Parameters
loss (AbstractLossFunction) – any descendant of AbstractLossFunction, those are very various. See
hep_ml.losses
for available losses.n_estimators (int) – number of trained trees.
subsample (float) – fraction of data to use on each stage
learning_rate (float) – size of step.
update_tree (bool) – True by default. If False, ‘improvement’ step after fitting tree will be skipped.
train_features – features used by tree. Note that algorithm may require also variables used by loss function, but not listed here.
- fit(X, y, sample_weight=None)[source]¶
Train formula. Only two-class binary classification is supported with labels 0 and 1.
- Parameters
X – dataset of shape [n_samples, n_features]
y – labels, array-like of shape [n_samples]
sample_weight – array-like of shape [n_samples] or None
- Returns
self
- predict(X)[source]¶
Predicted classes for each event
- Parameters
X – pandas.DataFrame with all train_features
- Returns
numpy.array of shape [n_samples] with predicted classes.
- class hep_ml.gradientboosting.UGradientBoostingRegressor(loss=None, n_estimators=100, learning_rate=0.1, subsample=1.0, min_samples_split=2, min_samples_leaf=1, max_features=None, max_leaf_nodes=None, max_depth=3, splitter='best', update_tree=True, train_features=None, random_state=None)[source]¶
Bases:
hep_ml.gradientboosting.UGradientBoostingBase
,sklearn.base.RegressorMixin
Gradient Boosted regressor. Approximates target by sum of predictions of several trees.
max_depth, max_leaf_nodes, min_samples_leaf, min_samples_split, max_features are parameters of regression tree, which is used as base estimator.
- Parameters
loss (AbstractLossFunction) – any descendant of AbstractLossFunction, those are very various. See
hep_ml.losses
for available losses.n_estimators (int) – number of trained trees.
subsample (float) – fraction of data to use on each stage
learning_rate (float) – size of step.
update_tree (bool) – True by default. If False, ‘improvement’ step after fitting tree will be skipped.
train_features – features used by tree. Note that algorithm may require also variables used by loss function, but not listed here.
- fit(X, y, sample_weight=None)[source]¶
Fit estimator.
- Parameters
X – dataset of shape [n_samples, n_features]
y – target values, array-like of shape [n_samples]
sample_weight – array-like of shape [n_samples] or None
- Returns
self