In the end of this August our team from Yandex organized MLHEP 2015 - summer school on Machine Learning in High Energy Physics.

School lasted only for 4 days, but even in this little time we managed to teach many things.

School contained of two tracks: introductory and advanced, every day each track has 2 lectures + 2 practical seminars. Also in each evening there was a special physical talk by invited speakers from CERN.

No, this is not everything: we organized inclass kaggle competition based on the COMET tracking problem I wrote about (part 1, part 2), so participants played with ML methods on real-world problem.

I gave lectures on introductory track. This was really challenging - put the course of ML in 4 days to people who have no experience in ML and have different background (while major part of introductory track listeners were particle physicists, but this is not very helpful).

One more caveat: since the schedule was completely filled, we decided to give no tasks (and thus all the theoretical knowledge will be obtained from slides).

For this purpose I decided to minimize the number of things introduced in course. The only non-trivial notion I used was decision function. No $F(x)$, no $h_i(x)$, no $Q(x, y)$, no margins, no $\Theta$, no $C(Y, F)$ and other stuff.

Despite this limitations, course contained all the ‘starter kit’ and even more:

  • knn
  • optimal bayesian classifier, QDA
  • logistic regression
  • neural networks
  • decision trees, building, splitting criterions
  • estimating feature importance
  • overfitting
  • ensembles, bagging
  • Random Forest
  • comparison of multidimensional distributions
  • AdaBoost
  • Gradient Boosting, modifications for regression, classification, ranking
  • Boosting to uniformity (uBoost and FlatnessLoss)
  • Fast predictions for online trigger systems (Bonsai BDT)
  • reweighting, Gradient Boosted reweighting
  • hyper-parameters optimization, Gaussian Processes
  • using classifiers’ output to test physical hypotheses
  • unsupervised ML: PCA, autoencoders

Also I significantly reduced number of formulas and added different demonstrations of how different algorithms work.

This is really much for introductory 4-days course, but I consider this to be ok to give more during the course. The problem is I forgot to put some important notes with conclusions, next time I’ll add them explicitly to slides :)

Slides

MLHEP 2015: Introductory Lecture #1 from arogozhnikov
MLHEP 2015: Introductory Lecture #2 from arogozhnikov
MLHEP 2015: Introductory Lecture #3 from arogozhnikov
MLHEP 2015: Introductory Lecture #4 from arogozhnikov
  1. All materials from school
  2. Official school site
  3. Kaggle competition for school