Fast predictions¶
hep_ml.speedup is module to obtain formulas with machine learning, which can be applied very fast (with a speed comparable to simple selections), while keeping high quality of classification.
In many application (i.e. triggers in HEP) it is pressing to get really fast formula. This module contains tools to prepare formulas, which can be applied with the speed comparable to cuts.
Example¶
Let’s show how one can use some really heavy classifier and still have fast predictions:
>>> from sklearn.ensemble import RandomForestClassifier
>>> from hep_ml.speedup import LookupClassifier
>>> base_classifier = RandomForestClassifier(n_estimators=1000, max_depth=25)
>>> classifier = LookupClassifier(base_estimator=base_classifier, keep_trained_estimator=False)
>>> classifier.fit(X, y, sample_weight=sample_weight)
Though training takes much time, all predictions are precomputed and saved to lookup table, so you are able to predict millions of events per second using single CPU:
>>> classifier.predict_proba(testX)
- class hep_ml.speedup.LookupClassifier(base_estimator, n_bins=16, max_cells=500000000, keep_trained_estimator=True)[source]¶
Bases:
sklearn.base.BaseEstimator
,sklearn.base.ClassifierMixin
LookupClassifier splits each of features into bins, trains a base_estimator to use this data. To predict class for new observation, results of base_estimator are kept for all possible combinations of bins, and saved together
- Parameters
base_estimator – classifier used to build predictions
n_bins (int | dict) –
int: how many bins to use for each axis
dict: feature_name -> int, specialize how many bins to use for each axis
dict: feature_name -> list of floats, set manually edges of bins
By default, the (weighted) quantiles are used to compute bin edges.
max_cells (int) – raise error if lookup table will have more items.
keep_trained_estimator (bool) – if True, trained estimator will be saved.
See also: this idea is used inside LHCb triggers, see V. Gligorov, M. Williams, ‘Bonsai BDT’
Resulting formula is very simple and can be rewritten in other language or environment (C++, CUDA, etc).
- convert_bins_to_lookup_index(bins_indices)[source]¶
- Parameters
bins_indices – numpy.array of shape [n_samples, n_columns], filled with indices of bins.
- Returns
numpy.array of shape [n_samples] with corresponding index in lookup table
- convert_lookup_index_to_bins(lookup_indices)[source]¶
- Parameters
lookup_indices – array of shape [n_samples] with positions at lookup table
- Returns
array of shape [n_samples, n_features] with indices of bins.
- fit(X, y, sample_weight=None)[source]¶
Train a classifier and collect predictions for all possible combinations.
- Parameters
X – pandas.DataFrame or numpy.array with data of shape [n_samples, n_features]
y – array with labels of shape [n_samples]
sample_weight – None or array of shape [n_samples] with weights of events
- Returns
self
- predict(X)[source]¶
Predict class for each event
- Parameters
X – pandas.DataFrame with data
- Returns
array of shape [n_samples] with predicted class labels.