Yesterday I dot an interesting idea on how to implement automatic anomaly detection at CERN experiments. Today this work is done manually - many students / PhD students are looking at different distributions online. This first is quite inreliable, second - it's quite expensive, since you need many people to work all the time (nobody is paid for this - but you anyway spend money on travels).
So, the basic idea is quite simple: one can bin each variable and look at distributions within each of bins. Knowing, that number of events observed inside each bin is Poisson-distributed, one can detect anomalies.
However, this detects only deviations of single variable. How
to compute deviations of many variables?
Inside LHCb experiment, for instance, we have topological trigger, which uses gradient-boosted regression trees to filter out events. Trees are actually splitting data into bins, so one can use this and eatimate for one tree the probability of observing anomaly. Here we can apply Wilks theorem, but only for every particular tree, since bins of different trees are correlated.