Fast predictions

hep_ml.speedup is module to obtain formulas with machine learning, which can be applied very fast (with a speed comparable to simple selections), while keeping high quality of classification.

In many application (i.e. triggers in HEP) it is pressing to get really fast formula. This module contains tools to prepare formulas, which can be applied with the speed comparable to cuts.

Example

Let’s show how one can use some really heavy classifier and still have fast predictions:

>>> from sklearn.ensemble import RandomForestClassifier
>>> from hep_ml.speedup import LookupClassifier
>>> base_classifier = RandomForestClassifier(n_estimators=1000, max_depth=25)
>>> classifier = LookupClassifier(base_estimator=base_classifier, keep_trained_estimator=False)
>>> classifier.fit(X, y, sample_weight=sample_weight)

Though training takes much time, all predictions are precomputed and saved to lookup table, so you are able to predict millions of events per second using single CPU:

>>> classifier.predict_proba(testX)
class hep_ml.speedup.LookupClassifier(base_estimator, n_bins=16, max_cells=500000000, keep_trained_estimator=True)[source]

Bases: BaseEstimator, ClassifierMixin

LookupClassifier splits each of features into bins, trains a base_estimator to use this data. To predict class for new observation, results of base_estimator are kept for all possible combinations of bins, and saved together

Parameters:
  • base_estimator – classifier used to build predictions

  • n_bins (int | dict) –

    • int: how many bins to use for each axis

    • dict: feature_name -> int, specialize how many bins to use for each axis

    • dict: feature_name -> list of floats, set manually edges of bins

    By default, the (weighted) quantiles are used to compute bin edges.

  • max_cells (int) – raise error if lookup table will have more items.

  • keep_trained_estimator (bool) – if True, trained estimator will be saved.

See also: this idea is used inside LHCb triggers, see V. Gligorov, M. Williams, ‘Bonsai BDT’

Resulting formula is very simple and can be rewritten in other language or environment (C++, CUDA, etc).

check_dimensions(bin_edges)[source]
convert_bins_to_lookup_index(bins_indices)[source]
Parameters:

bins_indices – numpy.array of shape [n_samples, n_columns], filled with indices of bins.

Returns:

numpy.array of shape [n_samples] with corresponding index in lookup table

convert_lookup_index_to_bins(lookup_indices)[source]
Parameters:

lookup_indices – array of shape [n_samples] with positions at lookup table

Returns:

array of shape [n_samples, n_features] with indices of bins.

fit(X, y, sample_weight=None)[source]

Train a classifier and collect predictions for all possible combinations.

Parameters:
  • X – pandas.DataFrame or numpy.array with data of shape [n_samples, n_features]

  • y – array with labels of shape [n_samples]

  • sample_weight – None or array of shape [n_samples] with weights of events

Returns:

self

predict(X)[source]

Predict class for each event

Parameters:

X – pandas.DataFrame with data

Returns:

array of shape [n_samples] with predicted class labels.

predict_proba(X)[source]

Predict probabilities for new observations

Parameters:

X – pandas.DataFrame with data

Returns:

probabilities, array of shape [n_samples, n_classes]

set_fit_request(*, sample_weight: bool | None | str = '$UNCHANGED$') LookupClassifier

Request metadata passed to the fit method.

Note that this method is only relevant if enable_metadata_routing=True (see sklearn.set_config()). Please see User Guide on how the routing mechanism works.

The options for each parameter are:

  • True: metadata is requested, and passed to fit if provided. The request is ignored if metadata is not provided.

  • False: metadata is not requested and the meta-estimator will not pass it to fit.

  • None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.

  • str: metadata should be passed to the meta-estimator with this given alias instead of the original name.

The default (sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.

Added in version 1.3.

Note

This method is only relevant if this estimator is used as a sub-estimator of a meta-estimator, e.g. used inside a Pipeline. Otherwise it has no effect.

Parameters

sample_weightstr, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED

Metadata routing for sample_weight parameter in fit.

Returns

selfobject

The updated object.

set_score_request(*, sample_weight: bool | None | str = '$UNCHANGED$') LookupClassifier

Request metadata passed to the score method.

Note that this method is only relevant if enable_metadata_routing=True (see sklearn.set_config()). Please see User Guide on how the routing mechanism works.

The options for each parameter are:

  • True: metadata is requested, and passed to score if provided. The request is ignored if metadata is not provided.

  • False: metadata is not requested and the meta-estimator will not pass it to score.

  • None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.

  • str: metadata should be passed to the meta-estimator with this given alias instead of the original name.

The default (sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.

Added in version 1.3.

Note

This method is only relevant if this estimator is used as a sub-estimator of a meta-estimator, e.g. used inside a Pipeline. Otherwise it has no effect.

Parameters

sample_weightstr, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED

Metadata routing for sample_weight parameter in score.

Returns

selfobject

The updated object.

transform(X)[source]

Convert data to bin indices.

Parameters:

X – pandas.DataFrame or numpy.array with data

Returns:

numpy.array, where each column is replaced with index of bin