Reweighting algorithms

hep_ml.reweight contains reweighting algorithms.

Reweighting is procedure of finding such weights for original distribution, that make distribution of one or several variables identical in original distribution and target distribution.

Typical application of this technique in HEP is reweighting of Monte-Carlo simulation results to minimize disagreement between simulated data and real data. Frequently the reweighting rule is trained on one part of data (normalization channel) and applied to different (signal channel).

Remark: if each variable has identical distribution in two samples, this doesn’t imply that multidimensional distributions are equal (almost surely they aren’t). Aim of reweighters is to get identical multidimensional distributions.

Algorithms are implemented as estimators, fitting and reweighting stages are split. Fitted reweighter can be applied many times to different data, pickled and so on.

Folding over reweighter is also availabel. This provides an easy way to run k-Folding cross-validation. Also it is a nice way to combine weights predictions of trained reweighters.

Examples

The most common use case is reweighting of Monte-Carlo simulations results to sPlotted real data. (original weights are all equal to 1 and could be skipped, but left here for example)

>>> from hep_ml.reweight import BinsReweighter, GBReweighter
>>> original_weights = numpy.ones(len(MC_data))
>>> reweighter = BinsReweighter(n_bins=100, n_neighs=3)
>>> reweighter.fit(original=MC_data, target=RealData,
>>>                original_weight=original_weights, target_weight=sWeights)
>>> MC_weights = reweighter.predict_weights(MC_data, original_weight=original_weights)

The same example for GBReweighter:

>>> reweighter = GBReweighter(max_depth=2, gb_args={'subsample': 0.5})
>>> reweighter.fit(original=MC_data, target=RealData, target_weight=sWeights)
>>> MC_weights = reweighter.predict_weights(MC_data)

Folding over reweighter:

>>> reweighter_base = GBReweighter(max_depth=2, gb_args={'subsample': 0.5})
>>> reweighter = FoldingReweighter(reweighter_base, n_folds=3)
>>> reweighter.fit(original=MC_data, target=RealData, target_weight=sWeights)

If the same data used in the training process are predicted by folding reweighter weights predictions will be unbiased: each reweighter predicts only those part of data which is not used during its training

>>> MC_weights = reweighter.predict_weights(MC_data)
class hep_ml.reweight.BinsReweighter(n_bins=200, n_neighs=3.0)[source]

Bases: sklearn.base.BaseEstimator, hep_ml.reweight.ReweighterMixin

Use bins for reweighting. Bins’ edges are computed using quantiles along each axis (which is better than bins of even size).

This method works fine for 1d/2d histograms, while being unstable or inaccurate for higher dimensions.

To make computed rule more smooth and stable, after computing weights in bins, gaussian filter is applied (so reweighting coefficient also includes information from neighbouring bins).

Parameters
  • n_bins (int) – how many bins to use for each input variable.

  • n_neighs (float) – size of gaussian filter (in bins). This parameter is responsible for tradeoff between stability of rule and accuracy of predictions. With increase of n_neighs the reweighting rule becomes more stable.

compute_bin_indices(data)[source]

Compute id of bin along each axis.

Parameters

data – data, array-like of shape [n_samples, n_features] with the same order of features as in training

Returns

numpy.array of shape [n_samples, n_features] with integers, each from [0, n_bins - 1]

fit(original, target, original_weight=None, target_weight=None)[source]

Prepare reweighting formula by computing histograms.

Parameters
  • original – values from original distribution, array-like of shape [n_samples, n_features]

  • target – values from target distribution, array-like of shape [n_samples, n_features]

  • original_weight – weights for samples of original distributions

  • target_weight – weights for samples of original distributions

Returns

self

predict_weights(original, original_weight=None)[source]

Returns corrected weights. Result is computed as original_weight * reweighter_multipliers.

Parameters
  • original – values from original distribution of shape [n_samples, n_features]

  • original_weight – weights of samples before reweighting.

Returns

numpy.array of shape [n_samples] with new weights.

class hep_ml.reweight.FoldingReweighter(base_reweighter, n_folds=2, random_state=None, verbose=True)[source]

Bases: sklearn.base.BaseEstimator, hep_ml.reweight.ReweighterMixin

This meta-regressor implements folding algorithm over reweighter:

  • training data is splitted into n equal parts;

  • we train n reweighters, each one is trained using n-1 folds

To build unbiased predictions for data, pass the same dataset (with same order of events) as in training to predict_weights, in which case a reweighter will be used to predict each event that the reweighter didn’t use it during training. To use information from not one, but several reweighters during predictions, provide appropriate voting function. Examples of voting function: >>> voting = lambda x: numpy.mean(x, axis=0) >>> voting = lambda x: numpy.median(x, axis=0)

Parameters
  • base_reweighter (ReweighterMixin) – base reweighter object

  • n_folds – number of folds

  • random_state (None or int or RandomState) – random state for reproducibility

  • verbose (bool) –

fit(original, target, original_weight=None, target_weight=None)[source]

Prepare reweighting formula by training a sequence of trees.

Parameters
  • original – values from original distribution, array-like of shape [n_samples, n_features]

  • target – values from target distribution, array-like of shape [n_samples, n_features]

  • original_weight – weights for samples of original distributions

  • target_weight – weights for samples of original distributions

Returns

self

predict_weights(original, original_weight=None, vote_function=None)[source]

Returns corrected weights. Result is computed as original_weight * reweighter_multipliers.

Parameters
  • original – values from original distribution of shape [n_samples, n_features]

  • original_weight – weights of samples before reweighting.

  • vote_function – if using averaging over predictions of folds, this function shall be passed. For instance: lambda x: numpy.mean(x, axis=0), which means averaging result over all folds. Another useful option is lambda x: numpy.median(x, axis=0)

Returns

numpy.array of shape [n_samples] with new weights.

class hep_ml.reweight.GBReweighter(n_estimators=40, learning_rate=0.2, max_depth=3, min_samples_leaf=200, loss_regularization=5.0, gb_args=None)[source]

Bases: sklearn.base.BaseEstimator, hep_ml.reweight.ReweighterMixin

Gradient Boosted Reweighter - a reweighter algorithm based on ensemble of regression trees. Parameters have the same role, as in gradient boosting. Special loss function is used, trees are trained to maximize symmetrized binned chi-squared statistics.

Training takes much more time than for bin-based versions, but GBReweighter is capable to work in high dimensions while keeping reweighting rule reliable and precise (and even smooth if many trees are used).

Parameters
  • n_estimators – number of trees

  • learning_rate – float from [0, 1]. Lesser learning rate requires more trees, but makes reweighting rule more stable.

  • max_depth – maximal depth of trees

  • min_samples_leaf – minimal number of events in the leaf.

  • loss_regularization – float, approximately equal to number of events that algorithm ‘puts’ in each leaf to prevent exploding.

  • gb_args – other parameters passed to gradient boosting. Those are: subsample, min_samples_split, max_features, max_leaf_nodes For example: gb_args = {‘subsample’: 0.8, ‘max_features’: 0.75} See hep_ml.gradientboosting.UGradientBoostingClassifier.

fit(original, target, original_weight=None, target_weight=None)[source]

Prepare reweighting formula by training sequence of trees.

Parameters
  • original – values from original distribution, array-like of shape [n_samples, n_features]

  • target – values from target distribution, array-like of shape [n_samples, n_features]

  • original_weight – weights for samples of original distributions

  • target_weight – weights for samples of original distributions

Returns

self

predict_weights(original, original_weight=None)[source]

Returns corrected weights. Result is computed as original_weight * reweighter_multipliers.

Parameters
  • original – values from original distribution of shape [n_samples, n_features]

  • original_weight – weights of samples before reweighting.

Returns

numpy.array of shape [n_samples] with new weights.