Reweighting algorithms¶

hep_ml.reweight contains reweighting algorithms.

Reweighting is procedure of finding such weights for original distribution, that make distribution of one or several variables identical in original distribution and target distribution.

Typical application of this technique in HEP is reweighting of Monte-Carlo simulation results to minimize disagreement between simulated data and real data. Frequently the reweighting rule is trained on one part of data (normalization channel) and applied to different (signal channel).

Remark: if each variable has identical distribution in two samples, this doesn’t imply that multidimensional distributions are equal (almost surely they aren’t). Aim of reweighters is to get identical multidimensional distributions.

Algorithms are implemented as estimators, fitting and reweighting stages are split. Fitted reweighter can be applied many times to different data, pickled and so on.

Remark: The normalization constant in the reweighters is not fixed. This is to ensure that the output of reweighter.predict_weights is deterministic; for example, if you predict weights for a large sample all at once or predict weights separately for each event and then concantenate the predictions, the result will be the same—the results would be different were the weights automatically normalized to the number of events. If normalization plays a significant role in your application, you should normalize the weights yourself.

Folding over reweighter is also availabel. This provides an easy way to run k-Folding cross-validation. Also it is a nice way to combine weights predictions of trained reweighters.

Examples¶

The most common use case is reweighting of Monte-Carlo simulations results to sPlotted real data. (original weights are all equal to 1 and could be skipped, but left here for example)

>>> from hep_ml.reweight import BinsReweighter, GBReweighter
>>> original_weights = numpy.ones(len(MC_data))
>>> reweighter = BinsReweighter(n_bins=100, n_neighs=3)
>>> reweighter.fit(original=MC_data, target=RealData,
>>>                original_weight=original_weights, target_weight=sWeights)
>>> MC_weights = reweighter.predict_weights(MC_data, original_weight=original_weights)

The same example for GBReweighter:

>>> reweighter = GBReweighter(max_depth=2, gb_args={'subsample': 0.5})
>>> reweighter.fit(original=MC_data, target=RealData, target_weight=sWeights)
>>> MC_weights = reweighter.predict_weights(MC_data)

Folding over reweighter:

>>> reweighter_base = GBReweighter(max_depth=2, gb_args={'subsample': 0.5})
>>> reweighter = FoldingReweighter(reweighter_base, n_folds=3)
>>> reweighter.fit(original=MC_data, target=RealData, target_weight=sWeights)

If the same data used in the training process are predicted by folding reweighter weights predictions will be unbiased: each reweighter predicts only those part of data which is not used during its training

>>> MC_weights = reweighter.predict_weights(MC_data)

class hep_ml.reweight.BinsReweighter(n_bins=200, n_neighs=3.0)[source]¶

Bases: BaseEstimator, ReweighterMixin

Use bins for reweighting. Bins’ edges are computed using quantiles along each axis (which is better than bins of even size).

This method works fine for 1d/2d histograms, while being unstable or inaccurate for higher dimensions.

To make computed rule more smooth and stable, after computing weights in bins, gaussian filter is applied (so reweighting coefficient also includes information from neighbouring bins).

Parameters:

n_bins (int) – how many bins to use for each input variable.
n_neighs (float) – size of gaussian filter (in bins). This parameter is responsible for tradeoff between stability of rule and accuracy of predictions. With increase of n_neighs the reweighting rule becomes more stable.

compute_bin_indices(data)[source]¶

Compute id of bin along each axis.

Parameters:: data – data, array-like of shape [n_samples, n_features] with the same order of features as in training
Returns:: numpy.array of shape [n_samples, n_features] with integers, each from [0, n_bins - 1]

fit(original, target, original_weight=None, target_weight=None)[source]¶

Prepare reweighting formula by computing histograms.

Parameters:

original – values from original distribution, array-like of shape [n_samples, n_features]
target – values from target distribution, array-like of shape [n_samples, n_features]
original_weight – weights for samples of original distributions
target_weight – weights for samples of original distributions

Returns:

self

predict_weights(original, original_weight=None)[source]¶

Returns corrected weights. Result is computed as original_weight * reweighter_multipliers.

Parameters:

original – values from original distribution of shape [n_samples, n_features]
original_weight – weights of samples before reweighting.

Returns:

numpy.array of shape [n_samples] with new weights.

Request metadata passed to the fit method.

Note that this method is only relevant if enable_metadata_routing=True (see sklearn.set_config()). Please see User Guide on how the routing mechanism works.

The options for each parameter are:

True: metadata is requested, and passed to fit if provided. The request is ignored if metadata is not provided.
False: metadata is not requested and the meta-estimator will not pass it to fit.
None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.
str: metadata should be passed to the meta-estimator with this given alias instead of the original name.

The default (sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.

Added in version 1.3.

Note

This method is only relevant if this estimator is used as a sub-estimator of a meta-estimator, e.g. used inside a Pipeline. Otherwise it has no effect.

Parameters¶

originalstr, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED: Metadata routing for original parameter in fit.
original_weightstr, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED: Metadata routing for original_weight parameter in fit.
targetstr, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED: Metadata routing for target parameter in fit.
target_weightstr, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED: Metadata routing for target_weight parameter in fit.

Returns¶

selfobject: The updated object.

class hep_ml.reweight.FoldingReweighter(base_reweighter, n_folds=2, random_state=None, verbose=True)[source]¶

Bases: BaseEstimator, ReweighterMixin

This meta-regressor implements folding algorithm over reweighter:

training data is splitted into n equal parts;
we train n reweighters, each one is trained using n-1 folds

To build unbiased predictions for data, pass the same dataset (with same order of events) as in training to predict_weights, in which case a reweighter will be used to predict each event that the reweighter didn’t use it during training. To use information from not one, but several reweighters during predictions, provide appropriate voting function. Examples of voting function: >>> voting = lambda x: numpy.mean(x, axis=0) >>> voting = lambda x: numpy.median(x, axis=0)

Parameters:

base_reweighter (ReweighterMixin) – base reweighter object
n_folds – number of folds
random_state (None or int or RandomState) – random state for reproducibility
verbose (bool)

fit(original, target, original_weight=None, target_weight=None)[source]¶

Prepare reweighting formula by training a sequence of trees.

Parameters:

original – values from original distribution, array-like of shape [n_samples, n_features]
target – values from target distribution, array-like of shape [n_samples, n_features]
original_weight – weights for samples of original distributions
target_weight – weights for samples of original distributions

Returns:

self

predict_weights(original, original_weight=None, vote_function=None)[source]¶

Returns corrected weights. Result is computed as original_weight * reweighter_multipliers.

Parameters:

original – values from original distribution of shape [n_samples, n_features]
original_weight – weights of samples before reweighting.
vote_function – if using averaging over predictions of folds, this function shall be passed. For instance: lambda x: numpy.mean(x, axis=0), which means averaging result over all folds. Another useful option is lambda x: numpy.median(x, axis=0)

Returns:

numpy.array of shape [n_samples] with new weights.

Request metadata passed to the fit method.

Note that this method is only relevant if enable_metadata_routing=True (see sklearn.set_config()). Please see User Guide on how the routing mechanism works.

The options for each parameter are:

True: metadata is requested, and passed to fit if provided. The request is ignored if metadata is not provided.
False: metadata is not requested and the meta-estimator will not pass it to fit.
None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.
str: metadata should be passed to the meta-estimator with this given alias instead of the original name.

The default (sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.

Added in version 1.3.

Note

This method is only relevant if this estimator is used as a sub-estimator of a meta-estimator, e.g. used inside a Pipeline. Otherwise it has no effect.

Parameters¶

originalstr, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED: Metadata routing for original parameter in fit.
original_weightstr, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED: Metadata routing for original_weight parameter in fit.
targetstr, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED: Metadata routing for target parameter in fit.
target_weightstr, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED: Metadata routing for target_weight parameter in fit.

Returns¶

selfobject: The updated object.

class hep_ml.reweight.GBReweighter(n_estimators=40, learning_rate=0.2, max_depth=3, min_samples_leaf=200, loss_regularization=5.0, gb_args=None)[source]¶

Bases: BaseEstimator, ReweighterMixin

Gradient Boosted Reweighter - a reweighter algorithm based on ensemble of regression trees. Parameters have the same role, as in gradient boosting. Special loss function is used, trees are trained to maximize symmetrized binned chi-squared statistics.

Training takes much more time than for bin-based versions, but GBReweighter is capable to work in high dimensions while keeping reweighting rule reliable and precise (and even smooth if many trees are used).

Parameters:

n_estimators – number of trees
learning_rate – float from [0, 1]. Lesser learning rate requires more trees, but makes reweighting rule more stable.
max_depth – maximal depth of trees
min_samples_leaf – minimal number of events in the leaf.
loss_regularization – float, approximately equal to number of events that algorithm ‘puts’ in each leaf to prevent exploding.
gb_args – other parameters passed to gradient boosting. Those are: subsample, min_samples_split, max_features, max_leaf_nodes For example: gb_args = {‘subsample’: 0.8, ‘max_features’: 0.75} See hep_ml.gradientboosting.UGradientBoostingClassifier.

fit(original, target, original_weight=None, target_weight=None)[source]¶

Prepare reweighting formula by training sequence of trees.

Parameters:

original – values from original distribution, array-like of shape [n_samples, n_features]
target – values from target distribution, array-like of shape [n_samples, n_features]
original_weight – weights for samples of original distributions
target_weight – weights for samples of original distributions

Returns:

self

predict_weights(original, original_weight=None)[source]¶

Returns corrected weights. Result is computed as original_weight * reweighter_multipliers.

Parameters:

original – values from original distribution of shape [n_samples, n_features]
original_weight – weights of samples before reweighting.

Returns:

numpy.array of shape [n_samples] with new weights.

Request metadata passed to the fit method.

Note that this method is only relevant if enable_metadata_routing=True (see sklearn.set_config()). Please see User Guide on how the routing mechanism works.

The options for each parameter are:

True: metadata is requested, and passed to fit if provided. The request is ignored if metadata is not provided.
False: metadata is not requested and the meta-estimator will not pass it to fit.
None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.
str: metadata should be passed to the meta-estimator with this given alias instead of the original name.

The default (sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.

Added in version 1.3.

Note

This method is only relevant if this estimator is used as a sub-estimator of a meta-estimator, e.g. used inside a Pipeline. Otherwise it has no effect.

Parameters¶

originalstr, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED: Metadata routing for original parameter in fit.
original_weightstr, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED: Metadata routing for original_weight parameter in fit.
targetstr, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED: Metadata routing for target parameter in fit.
target_weightstr, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED: Metadata routing for target_weight parameter in fit.

Returns¶

selfobject: The updated object.