uBoost

The module contains an implementation of uBoost algorithm. The main goal of uBoost is to fight correlation between predictions and some variables (i.e. mass of particle).

  • uBoostBDT is a modified version of AdaBoost, that targets to obtain efficiency uniformity at the specified level (global efficiency)

  • uBoostClassifier is a combination of uBoostBDTs for different efficiencies

This implementation is more advanced than one described in the original paper, contains smoothing and trains classifiers in threads, has learning_rate and uniforming_rate parameters, does automatic weights renormalization and supports SAMME.R modification to use predicted probabilities.

Only binary classification is implemented.

See also: hep_ml.losses.BinFlatnessLossFunction, hep_ml.losses.KnnFlatnessLossFunction, hep_ml.losses.KnnAdaLossFunction to fight correlation.

Examples

To get uniform prediction in mass for background:

>>> base_tree = DecisionTreeClassifier(max_depth=3)
>>> clf = uBoostClassifier(uniform_features=['mass'], uniform_label=0, base_estimator=base_tree,
>>>                        train_features=['pt', 'flight_time'])
>>> clf.fit(train_data, train_labels, sample_weight=train_weights)
>>> proba = clf.predict_proba(test_data)

To get uniform prediction in Dalitz variables for signal

>>> clf = uBoostClassifier(uniform_features=['mass_12', 'mass_23'], uniform_label=1, base_estimator=base_tree,
>>>                        train_features=['pt', 'flight_time'])
>>> clf.fit(train_data, train_labels, sample_weight=train_weights)
>>> proba = clf.predict_proba(test_data)
class hep_ml.uboost.uBoostBDT(uniform_features, uniform_label, target_efficiency=0.5, n_neighbors=50, subsample=1.0, base_estimator=None, n_estimators=50, learning_rate=1.0, uniforming_rate=1.0, train_features=None, smoothing=0.0, random_state=None, algorithm='SAMME')[source]

Bases: BaseEstimator, ClassifierMixin

uBoostBDT is AdaBoostClassifier, which is modified to have flat efficiency of signal (class=1) along some variables. Efficiency is only guaranteed at the cut, corresponding to global efficiency == target_efficiency.

Can be used alone, without uBoostClassifier.

Parameters:
  • uniform_features – list of strings, names of variables, along which flatness is desired

  • uniform_label – int, label of class on which uniformity is desired (typically 0 for background, 1 for signal).

  • target_efficiency – float, the flatness is obtained at global BDT cut, corresponding to global efficiency

  • n_neighbors – int, (default=50) the number of neighbours, which are used to compute local efficiency

  • subsample – float (default=1.0), part of training dataset used to build each base estimator.

  • base_estimator – classifier, optional (default=DecisionTreeClassifier(max_depth=2)) The base estimator from which the boosted ensemble is built. Support for sample weighting is required, as well as proper classes_ and n_classes_ attributes.

  • n_estimators – integer, optional (default=50) number of estimators used.

  • learning_rate – float, optional (default=1.) Learning rate shrinks the contribution of each classifier by learning_rate. There is a trade-off between learning_rate and n_estimators.

  • uniforming_rate – float, optional (default=1.) how much do we take into account the uniformity of signal, there is a trade-off between uniforming_rate and the speed of uniforming, zero value corresponds to plain AdaBoost

  • train_features – list of strings, names of variables used in fit/predict. If None, all the variables are used (including uniform_variables)

  • smoothing – float, (default=0.), used to smooth computing of local efficiencies, 0.0 corresponds to usual uBoost

  • random_state – int, RandomState instance or None (default None)

Reference

decision_function(X)[source]

Decision function. Float for each sample, the greater — the more signal like event is.

Parameters:

X – data, pandas.DataFrame of shape [n_samples, n_features]

Returns:

array of shape [n_samples] with floats

property feature_importances_

Return the feature importances for train_features.

Returns:

array of shape [n_features], the order is the same as in train_features

fit(X, y, sample_weight=None, neighbours_matrix=None)[source]

Build a boosted classifier from the training set (X, y).

Parameters:
  • X – array-like of shape [n_samples, n_features]

  • y – labels, array of shape [n_samples] with 0 and 1.

  • sample_weight – array-like of shape [n_samples] or None

  • neighbours_matrix – array-like of shape [n_samples, n_neighbours], each row contains indices of signal neighbours (neighbours should be computed for background too), if None, this matrix is computed.

Returns:

self

predict(X)[source]

Predict classes for each sample

Parameters:

X – data, pandas.DataFrame of shape [n_samples, n_features]

Returns:

array of shape [n_samples] with predicted classes.

predict_proba(X)[source]

Predict probabilities

Parameters:

X – data, pandas.DataFrame of shape [n_samples, n_features]

Returns:

array of shape [n_samples, n_classes] with probabilities.

set_fit_request(*, neighbours_matrix: bool | None | str = '$UNCHANGED$', sample_weight: bool | None | str = '$UNCHANGED$') uBoostBDT

Request metadata passed to the fit method.

Note that this method is only relevant if enable_metadata_routing=True (see sklearn.set_config()). Please see User Guide on how the routing mechanism works.

The options for each parameter are:

  • True: metadata is requested, and passed to fit if provided. The request is ignored if metadata is not provided.

  • False: metadata is not requested and the meta-estimator will not pass it to fit.

  • None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.

  • str: metadata should be passed to the meta-estimator with this given alias instead of the original name.

The default (sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.

Added in version 1.3.

Note

This method is only relevant if this estimator is used as a sub-estimator of a meta-estimator, e.g. used inside a Pipeline. Otherwise it has no effect.

Parameters

neighbours_matrixstr, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED

Metadata routing for neighbours_matrix parameter in fit.

sample_weightstr, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED

Metadata routing for sample_weight parameter in fit.

Returns

selfobject

The updated object.

set_score_request(*, sample_weight: bool | None | str = '$UNCHANGED$') uBoostBDT

Request metadata passed to the score method.

Note that this method is only relevant if enable_metadata_routing=True (see sklearn.set_config()). Please see User Guide on how the routing mechanism works.

The options for each parameter are:

  • True: metadata is requested, and passed to score if provided. The request is ignored if metadata is not provided.

  • False: metadata is not requested and the meta-estimator will not pass it to score.

  • None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.

  • str: metadata should be passed to the meta-estimator with this given alias instead of the original name.

The default (sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.

Added in version 1.3.

Note

This method is only relevant if this estimator is used as a sub-estimator of a meta-estimator, e.g. used inside a Pipeline. Otherwise it has no effect.

Parameters

sample_weightstr, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED

Metadata routing for sample_weight parameter in score.

Returns

selfobject

The updated object.

staged_decision_function(X)[source]

Decision function after each stage of boosting. Float for each sample, the greater — the more signal like event is.

Parameters:

X – data, pandas.DataFrame of shape [n_samples, n_features]

Returns:

array of shape [n_samples] with floats.

staged_predict_proba(X)[source]

Predicted probabilities for each sample after each stage of boosting.

Parameters:

X – data, pandas.DataFrame of shape [n_samples, n_features]

Returns:

sequence of numpy.arrays of shape [n_samples, n_classes]

class hep_ml.uboost.uBoostClassifier(uniform_features, uniform_label, train_features=None, n_neighbors=50, efficiency_steps=20, n_estimators=40, base_estimator=None, subsample=1.0, algorithm='SAMME', smoothing=None, n_threads=1, random_state=None)[source]

Bases: BaseEstimator, ClassifierMixin

uBoost classifier, an algorithm of boosting targeted to obtain flat efficiency in signal along some variables (e.g. mass).

In principle, uBoost is ensemble of uBoostBDTs. See [1] for details.

Parameters

param uniform_features:

list of strings, names of variables, along which flatness is desired

param uniform_label:

int, tha label of class for which uniformity is desired

param train_features:

list of strings, names of variables used in fit/predict. if None, all the variables are used (including uniform_variables)

param n_neighbors:

int, (default=50) the number of neighbours, which are used to compute local efficiency

param n_estimators:

integer, optional (default=50) The maximum number of estimators at which boosting is terminated. In case of perfect fit, the learning procedure is stopped early.

param efficiency_steps:

integer, optional (default=20), How many uBoostBDTs should be trained (each with its own target_efficiency)

param base_estimator:

object, optional (default=DecisionTreeClassifier(max_depth=2)) The base estimator from which the boosted ensemble is built. Support for sample weighting is required, as well as proper classes_ and n_classes_ attributes.

param subsample:

float (default =1.) part of training dataset used to train each base classifier.

param smoothing:

float, default=None, used to smooth computing of local efficiencies, 0.0 corresponds to usual uBoost,

param random_state:

int, RandomState instance or None, (default=None)

param n_threads:

int, number of threads used.

Reference

fit(X, y, sample_weight=None)[source]

Build a boosted classifier from the training set.

Parameters:
  • X – data, pandas.DatFrame of shape [n_samples, n_features]

  • y – labels, array of shape [n_samples] with 0 and 1. The target values (integers that correspond to classes).

  • sample_weight – array-like of shape [n_samples] with weights or None

Returns:

self

predict(X)[source]

Predict labels

Parameters:

X – data, pandas.DataFrame of shape [n_samples, n_features]

Returns:

numpy.array of shape [n_samples]

predict_proba(X)[source]

Predict probabilities

Parameters:

X – data, pandas.DataFrame of shape [n_samples, n_features]

Returns:

array of shape [n_samples, n_classes] with probabilities.

set_fit_request(*, sample_weight: bool | None | str = '$UNCHANGED$') uBoostClassifier

Request metadata passed to the fit method.

Note that this method is only relevant if enable_metadata_routing=True (see sklearn.set_config()). Please see User Guide on how the routing mechanism works.

The options for each parameter are:

  • True: metadata is requested, and passed to fit if provided. The request is ignored if metadata is not provided.

  • False: metadata is not requested and the meta-estimator will not pass it to fit.

  • None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.

  • str: metadata should be passed to the meta-estimator with this given alias instead of the original name.

The default (sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.

Added in version 1.3.

Note

This method is only relevant if this estimator is used as a sub-estimator of a meta-estimator, e.g. used inside a Pipeline. Otherwise it has no effect.

Parameters

sample_weightstr, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED

Metadata routing for sample_weight parameter in fit.

Returns

selfobject

The updated object.

set_score_request(*, sample_weight: bool | None | str = '$UNCHANGED$') uBoostClassifier

Request metadata passed to the score method.

Note that this method is only relevant if enable_metadata_routing=True (see sklearn.set_config()). Please see User Guide on how the routing mechanism works.

The options for each parameter are:

  • True: metadata is requested, and passed to score if provided. The request is ignored if metadata is not provided.

  • False: metadata is not requested and the meta-estimator will not pass it to score.

  • None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.

  • str: metadata should be passed to the meta-estimator with this given alias instead of the original name.

The default (sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.

Added in version 1.3.

Note

This method is only relevant if this estimator is used as a sub-estimator of a meta-estimator, e.g. used inside a Pipeline. Otherwise it has no effect.

Parameters

sample_weightstr, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED

Metadata routing for sample_weight parameter in score.

Returns

selfobject

The updated object.

staged_predict_proba(X)[source]

Predicted probabilities for each sample after each stage of boosting.

Parameters:

X – data, pandas.DataFrame of shape [n_samples, n_features]

Returns:

sequence of numpy.arrays of shape [n_samples, n_classes]