uBoost

The module contains an implementation of uBoost algorithm. The main goal of uBoost is to fight correlation between predictions and some variables (i.e. mass of particle).

  • uBoostBDT is a modified version of AdaBoost, that targets to obtain efficiency uniformity at the specified level (global efficiency)

  • uBoostClassifier is a combination of uBoostBDTs for different efficiencies

This implementation is more advanced than one described in the original paper, contains smoothing and trains classifiers in threads, has learning_rate and uniforming_rate parameters, does automatic weights renormalization and supports SAMME.R modification to use predicted probabilities.

Only binary classification is implemented.

See also: hep_ml.losses.BinFlatnessLossFunction, hep_ml.losses.KnnFlatnessLossFunction, hep_ml.losses.KnnAdaLossFunction to fight correlation.

Examples

To get uniform prediction in mass for background:

>>> base_tree = DecisionTreeClassifier(max_depth=3)
>>> clf = uBoostClassifier(uniform_features=['mass'], uniform_label=0, base_estimator=base_tree,
>>>                        train_features=['pt', 'flight_time'])
>>> clf.fit(train_data, train_labels, sample_weight=train_weights)
>>> proba = clf.predict_proba(test_data)

To get uniform prediction in Dalitz variables for signal

>>> clf = uBoostClassifier(uniform_features=['mass_12', 'mass_23'], uniform_label=1, base_estimator=base_tree,
>>>                        train_features=['pt', 'flight_time'])
>>> clf.fit(train_data, train_labels, sample_weight=train_weights)
>>> proba = clf.predict_proba(test_data)
class hep_ml.uboost.uBoostBDT(uniform_features, uniform_label, target_efficiency=0.5, n_neighbors=50, subsample=1.0, base_estimator=None, n_estimators=50, learning_rate=1.0, uniforming_rate=1.0, train_features=None, smoothing=0.0, random_state=None, algorithm='SAMME')[source]

Bases: sklearn.base.BaseEstimator, sklearn.base.ClassifierMixin

uBoostBDT is AdaBoostClassifier, which is modified to have flat efficiency of signal (class=1) along some variables. Efficiency is only guaranteed at the cut, corresponding to global efficiency == target_efficiency.

Can be used alone, without uBoostClassifier.

Parameters
  • uniform_features – list of strings, names of variables, along which flatness is desired

  • uniform_label – int, label of class on which uniformity is desired (typically 0 for background, 1 for signal).

  • target_efficiency – float, the flatness is obtained at global BDT cut, corresponding to global efficiency

  • n_neighbors – int, (default=50) the number of neighbours, which are used to compute local efficiency

  • subsample – float (default=1.0), part of training dataset used to build each base estimator.

  • base_estimator – classifier, optional (default=DecisionTreeClassifier(max_depth=2)) The base estimator from which the boosted ensemble is built. Support for sample weighting is required, as well as proper classes_ and n_classes_ attributes.

  • n_estimators – integer, optional (default=50) number of estimators used.

  • learning_rate – float, optional (default=1.) Learning rate shrinks the contribution of each classifier by learning_rate. There is a trade-off between learning_rate and n_estimators.

  • uniforming_rate – float, optional (default=1.) how much do we take into account the uniformity of signal, there is a trade-off between uniforming_rate and the speed of uniforming, zero value corresponds to plain AdaBoost

  • train_features – list of strings, names of variables used in fit/predict. If None, all the variables are used (including uniform_variables)

  • smoothing – float, (default=0.), used to smooth computing of local efficiencies, 0.0 corresponds to usual uBoost

  • random_state – int, RandomState instance or None (default None)

1

J. Stevens, M. Williams ‘uBoost: A boosting method for producing uniform selection efficiencies from multivariate classifiers’

decision_function(X)[source]

Decision function. Float for each sample, the greater — the more signal like event is.

Parameters

X – data, pandas.DataFrame of shape [n_samples, n_features]

Returns

array of shape [n_samples] with floats

property feature_importances_

Return the feature importances for train_features.

Returns

array of shape [n_features], the order is the same as in train_features

fit(X, y, sample_weight=None, neighbours_matrix=None)[source]

Build a boosted classifier from the training set (X, y).

Parameters
  • X – array-like of shape [n_samples, n_features]

  • y – labels, array of shape [n_samples] with 0 and 1.

  • sample_weight – array-like of shape [n_samples] or None

  • neighbours_matrix – array-like of shape [n_samples, n_neighbours], each row contains indices of signal neighbours (neighbours should be computed for background too), if None, this matrix is computed.

Returns

self

predict(X)[source]

Predict classes for each sample

Parameters

X – data, pandas.DataFrame of shape [n_samples, n_features]

Returns

array of shape [n_samples] with predicted classes.

predict_proba(X)[source]

Predict probabilities

Parameters

X – data, pandas.DataFrame of shape [n_samples, n_features]

Returns

array of shape [n_samples, n_classes] with probabilities.

staged_decision_function(X)[source]

Decision function after each stage of boosting. Float for each sample, the greater — the more signal like event is.

Parameters

X – data, pandas.DataFrame of shape [n_samples, n_features]

Returns

array of shape [n_samples] with floats.

staged_predict_proba(X)[source]

Predicted probabilities for each sample after each stage of boosting.

Parameters

X – data, pandas.DataFrame of shape [n_samples, n_features]

Returns

sequence of numpy.arrays of shape [n_samples, n_classes]

class hep_ml.uboost.uBoostClassifier(uniform_features, uniform_label, train_features=None, n_neighbors=50, efficiency_steps=20, n_estimators=40, base_estimator=None, subsample=1.0, algorithm='SAMME', smoothing=None, n_threads=1, random_state=None)[source]

Bases: sklearn.base.BaseEstimator, sklearn.base.ClassifierMixin

uBoost classifier, an algorithm of boosting targeted to obtain flat efficiency in signal along some variables (e.g. mass).

In principle, uBoost is ensemble of uBoostBDTs. See [1] for details.

Parameters
  • uniform_features – list of strings, names of variables, along which flatness is desired

  • uniform_label – int, tha label of class for which uniformity is desired

  • train_features – list of strings, names of variables used in fit/predict. if None, all the variables are used (including uniform_variables)

  • n_neighbors – int, (default=50) the number of neighbours, which are used to compute local efficiency

  • n_estimators – integer, optional (default=50) The maximum number of estimators at which boosting is terminated. In case of perfect fit, the learning procedure is stopped early.

  • efficiency_steps – integer, optional (default=20), How many uBoostBDTs should be trained (each with its own target_efficiency)

  • base_estimator – object, optional (default=DecisionTreeClassifier(max_depth=2)) The base estimator from which the boosted ensemble is built. Support for sample weighting is required, as well as proper classes_ and n_classes_ attributes.

  • subsample – float (default =1.) part of training dataset used to train each base classifier.

  • smoothing – float, default=None, used to smooth computing of local efficiencies, 0.0 corresponds to usual uBoost,

  • random_state – int, RandomState instance or None, (default=None)

  • n_threads – int, number of threads used.

1

J. Stevens, M. Williams ‘uBoost: A boosting method for producing uniform selection efficiencies from multivariate classifiers’

fit(X, y, sample_weight=None)[source]

Build a boosted classifier from the training set.

Parameters
  • X – data, pandas.DatFrame of shape [n_samples, n_features]

  • y – labels, array of shape [n_samples] with 0 and 1. The target values (integers that correspond to classes).

  • sample_weight – array-like of shape [n_samples] with weights or None

Returns

self

predict(X)[source]

Predict labels

Parameters

X – data, pandas.DataFrame of shape [n_samples, n_features]

Returns

numpy.array of shape [n_samples]

predict_proba(X)[source]

Predict probabilities

Parameters

X – data, pandas.DataFrame of shape [n_samples, n_features]

Returns

array of shape [n_samples, n_classes] with probabilities.

staged_predict_proba(X)[source]

Predicted probabilities for each sample after each stage of boosting.

Parameters

X – data, pandas.DataFrame of shape [n_samples, n_features]

Returns

sequence of numpy.arrays of shape [n_samples, n_classes]