uBoost¶
The module contains an implementation of uBoost algorithm. The main goal of uBoost is to fight correlation between predictions and some variables (i.e. mass of particle).
uBoostBDT is a modified version of AdaBoost, that targets to obtain efficiency uniformity at the specified level (global efficiency)
uBoostClassifier is a combination of uBoostBDTs for different efficiencies
This implementation is more advanced than one described in the original paper, contains smoothing and trains classifiers in threads, has learning_rate and uniforming_rate parameters, does automatic weights renormalization and supports SAMME.R modification to use predicted probabilities.
Only binary classification is implemented.
See also: hep_ml.losses.BinFlatnessLossFunction
, hep_ml.losses.KnnFlatnessLossFunction
,
hep_ml.losses.KnnAdaLossFunction
to fight correlation.
Examples¶
To get uniform prediction in mass for background:
>>> base_tree = DecisionTreeClassifier(max_depth=3)
>>> clf = uBoostClassifier(uniform_features=['mass'], uniform_label=0, base_estimator=base_tree,
>>> train_features=['pt', 'flight_time'])
>>> clf.fit(train_data, train_labels, sample_weight=train_weights)
>>> proba = clf.predict_proba(test_data)
To get uniform prediction in Dalitz variables for signal
>>> clf = uBoostClassifier(uniform_features=['mass_12', 'mass_23'], uniform_label=1, base_estimator=base_tree,
>>> train_features=['pt', 'flight_time'])
>>> clf.fit(train_data, train_labels, sample_weight=train_weights)
>>> proba = clf.predict_proba(test_data)
- class hep_ml.uboost.uBoostBDT(uniform_features, uniform_label, target_efficiency=0.5, n_neighbors=50, subsample=1.0, base_estimator=None, n_estimators=50, learning_rate=1.0, uniforming_rate=1.0, train_features=None, smoothing=0.0, random_state=None, algorithm='SAMME')[source]¶
Bases:
sklearn.base.BaseEstimator
,sklearn.base.ClassifierMixin
uBoostBDT is AdaBoostClassifier, which is modified to have flat efficiency of signal (class=1) along some variables. Efficiency is only guaranteed at the cut, corresponding to global efficiency == target_efficiency.
Can be used alone, without uBoostClassifier.
- Parameters
uniform_features – list of strings, names of variables, along which flatness is desired
uniform_label – int, label of class on which uniformity is desired (typically 0 for background, 1 for signal).
target_efficiency – float, the flatness is obtained at global BDT cut, corresponding to global efficiency
n_neighbors – int, (default=50) the number of neighbours, which are used to compute local efficiency
subsample – float (default=1.0), part of training dataset used to build each base estimator.
base_estimator – classifier, optional (default=DecisionTreeClassifier(max_depth=2)) The base estimator from which the boosted ensemble is built. Support for sample weighting is required, as well as proper classes_ and n_classes_ attributes.
n_estimators – integer, optional (default=50) number of estimators used.
learning_rate – float, optional (default=1.) Learning rate shrinks the contribution of each classifier by
learning_rate
. There is a trade-off betweenlearning_rate
andn_estimators
.uniforming_rate – float, optional (default=1.) how much do we take into account the uniformity of signal, there is a trade-off between uniforming_rate and the speed of uniforming, zero value corresponds to plain AdaBoost
train_features – list of strings, names of variables used in fit/predict. If None, all the variables are used (including uniform_variables)
smoothing – float, (default=0.), used to smooth computing of local efficiencies, 0.0 corresponds to usual uBoost
random_state – int, RandomState instance or None (default None)
- 1
J. Stevens, M. Williams ‘uBoost: A boosting method for producing uniform selection efficiencies from multivariate classifiers’
- decision_function(X)[source]¶
Decision function. Float for each sample, the greater — the more signal like event is.
- Parameters
X – data, pandas.DataFrame of shape [n_samples, n_features]
- Returns
array of shape [n_samples] with floats
- property feature_importances_¶
Return the feature importances for train_features.
- Returns
array of shape [n_features], the order is the same as in train_features
- fit(X, y, sample_weight=None, neighbours_matrix=None)[source]¶
Build a boosted classifier from the training set (X, y).
- Parameters
X – array-like of shape [n_samples, n_features]
y – labels, array of shape [n_samples] with 0 and 1.
sample_weight – array-like of shape [n_samples] or None
neighbours_matrix – array-like of shape [n_samples, n_neighbours], each row contains indices of signal neighbours (neighbours should be computed for background too), if None, this matrix is computed.
- Returns
self
- predict(X)[source]¶
Predict classes for each sample
- Parameters
X – data, pandas.DataFrame of shape [n_samples, n_features]
- Returns
array of shape [n_samples] with predicted classes.
- predict_proba(X)[source]¶
Predict probabilities
- Parameters
X – data, pandas.DataFrame of shape [n_samples, n_features]
- Returns
array of shape [n_samples, n_classes] with probabilities.
- class hep_ml.uboost.uBoostClassifier(uniform_features, uniform_label, train_features=None, n_neighbors=50, efficiency_steps=20, n_estimators=40, base_estimator=None, subsample=1.0, algorithm='SAMME', smoothing=None, n_threads=1, random_state=None)[source]¶
Bases:
sklearn.base.BaseEstimator
,sklearn.base.ClassifierMixin
uBoost classifier, an algorithm of boosting targeted to obtain flat efficiency in signal along some variables (e.g. mass).
In principle, uBoost is ensemble of uBoostBDTs. See [1] for details.
- Parameters
uniform_features – list of strings, names of variables, along which flatness is desired
uniform_label – int, tha label of class for which uniformity is desired
train_features – list of strings, names of variables used in fit/predict. if None, all the variables are used (including uniform_variables)
n_neighbors – int, (default=50) the number of neighbours, which are used to compute local efficiency
n_estimators – integer, optional (default=50) The maximum number of estimators at which boosting is terminated. In case of perfect fit, the learning procedure is stopped early.
efficiency_steps – integer, optional (default=20), How many uBoostBDTs should be trained (each with its own target_efficiency)
base_estimator – object, optional (default=DecisionTreeClassifier(max_depth=2)) The base estimator from which the boosted ensemble is built. Support for sample weighting is required, as well as proper classes_ and n_classes_ attributes.
subsample – float (default =1.) part of training dataset used to train each base classifier.
smoothing – float, default=None, used to smooth computing of local efficiencies, 0.0 corresponds to usual uBoost,
random_state – int, RandomState instance or None, (default=None)
n_threads – int, number of threads used.
- 1
J. Stevens, M. Williams ‘uBoost: A boosting method for producing uniform selection efficiencies from multivariate classifiers’
- fit(X, y, sample_weight=None)[source]¶
Build a boosted classifier from the training set.
- Parameters
X – data, pandas.DatFrame of shape [n_samples, n_features]
y – labels, array of shape [n_samples] with 0 and 1. The target values (integers that correspond to classes).
sample_weight – array-like of shape [n_samples] with weights or None
- Returns
self
- predict(X)[source]¶
Predict labels
- Parameters
X – data, pandas.DataFrame of shape [n_samples, n_features]
- Returns
numpy.array of shape [n_samples]