Fast predictions

hep_ml.speedup is module to obtain formulas with machine learning, which can be applied very fast (with a speed comparable to simple selections), while keeping high quality of classification.

In many application (i.e. triggers in HEP) it is pressing to get really fast formula. This module contains tools to prepare formulas, which can be applied with the speed comparable to cuts.

Example

Let’s show how one can use some really heavy classifier and still have fast predictions:

>>> from sklearn.ensemble import RandomForestClassifier
>>> from hep_ml.speedup import LookupClassifier
>>> base_classifier = RandomForestClassifier(n_estimators=1000, max_depth=25)
>>> classifier = LookupClassifier(base_estimator=base_classifier, keep_trained_estimator=False)
>>> classifier.fit(X, y, sample_weight=sample_weight)

Though training takes much time, all predictions are precomputed and saved to lookup table, so you are able to predict millions of events per second using single CPU:

>>> classifier.predict_proba(testX)
class hep_ml.speedup.LookupClassifier(base_estimator, n_bins=16, max_cells=500000000, keep_trained_estimator=True)[source]

Bases: sklearn.base.BaseEstimator, sklearn.base.ClassifierMixin

LookupClassifier splits each of features into bins, trains a base_estimator to use this data. To predict class for new observation, results of base_estimator are kept for all possible combinations of bins, and saved together

Parameters
  • base_estimator – classifier used to build predictions

  • n_bins (int | dict) –

    • int: how many bins to use for each axis

    • dict: feature_name -> int, specialize how many bins to use for each axis

    • dict: feature_name -> list of floats, set manually edges of bins

    By default, the (weighted) quantiles are used to compute bin edges.

  • max_cells (int) – raise error if lookup table will have more items.

  • keep_trained_estimator (bool) – if True, trained estimator will be saved.

See also: this idea is used inside LHCb triggers, see V. Gligorov, M. Williams, ‘Bonsai BDT’

Resulting formula is very simple and can be rewritten in other language or environment (C++, CUDA, etc).

check_dimensions(bin_edges)[source]
convert_bins_to_lookup_index(bins_indices)[source]
Parameters

bins_indices – numpy.array of shape [n_samples, n_columns], filled with indices of bins.

Returns

numpy.array of shape [n_samples] with corresponding index in lookup table

convert_lookup_index_to_bins(lookup_indices)[source]
Parameters

lookup_indices – array of shape [n_samples] with positions at lookup table

Returns

array of shape [n_samples, n_features] with indices of bins.

fit(X, y, sample_weight=None)[source]

Train a classifier and collect predictions for all possible combinations.

Parameters
  • X – pandas.DataFrame or numpy.array with data of shape [n_samples, n_features]

  • y – array with labels of shape [n_samples]

  • sample_weight – None or array of shape [n_samples] with weights of events

Returns

self

predict(X)[source]

Predict class for each event

Parameters

X – pandas.DataFrame with data

Returns

array of shape [n_samples] with predicted class labels.

predict_proba(X)[source]

Predict probabilities for new observations

Parameters

X – pandas.DataFrame with data

Returns

probabilities, array of shape [n_samples, n_classes]

transform(X)[source]

Convert data to bin indices.

Parameters

X – pandas.DataFrame or numpy.array with data

Returns

numpy.array, where each column is replaced with index of bin