This year my team at Yandex organized MLHEP (Machine Learning in High Energy Physics) summer school in Lund, Sweden.

There were two tracks: basic and advanced, lasting for three days + 2 days on neural networks for both tracks together.

School was accompanied by two kaggle challenges: one for both tracks and one for advanced. This is the most producive way to try and learn techniques in practice.

Just as a year ago, I gave lectures for basic track. Previous materials were enriched with new topics and more explanations.

Also, I’ve added many visualizations and animations compared to the previous year.

This 3-day course is the shortest course of machine learning, and it
still gives nice introduction into some advanced topics!

Day 1

MLHEP Lectures - day 1, basic track from arogozhnikov

Introduction to machine learning terminology. Applications within High Energy Physics and outside HEP.

  • Basic problems: classification and regression.
  • Nearest neighbours approach and spacial indices
  • Overfitting (intro)
  • Curse of dimensionality
  • ROC curve, ROC AUC
  • Bayes optimal classifier
  • Density estimation: KDE and histograms
  • Parametric density estimation
    • Mixtures for density estimation and EM algorithm
  • Generative approach vs discriminative approach
  • Linear models:
    • Linear decision rule, intro to logistic regression
    • Linear regression

Day 2

MLHEP Lectures - day 2, basic track from arogozhnikov
  • Linear models: logistic regression
  • Polynomial decision rule and polynomial regression
  • SVM (Support Vector Machine) and kernel trick
  • Overfitting: two definitions
  • Model selection
  • Regularizations: L1, L2, elastic net.
  • Decision trees
    • Splitting criteria for classification and regression
    • Overfitting in trees: pre-stopping and post-pruning
    • Non-stability of trees
    • Feature importance
  • Ensembling
    • RSM, subsampling, bagging.
    • Random Forest

Day 3

MLHEP Lectures - day 3, basic track from arogozhnikov
  • Ensembles
    • AdaBoost
    • Gradient Boosting for regression
    • Gradient Boosting for classification
    • Second-order information
    • Losses: regression, classification, ranking
  • Multiclass classification:
    • ensembling
    • softmax modifications
  • Feature engineering and output engineering
  • Feature selection
  • Dimensionality rediction:
    • PCA
    • LDA, CSP
    • LLE
    • Isomap
  • Hyperparameter optimization
    • ML-based approach
    • Gaussian processes

Day 4, part 1

Reweighting and Boosting to uniforimty in HEP from arogozhnikov

Slides of Tatiana Likhomanenko on non-trivial applications of boosting in High Energy Physics.

Links

  1. All materials from school are available at MLHEP 2016 repository
  2. Official page at indico
  3. Kaggle competitions for school: exotic higgs and triggers