Dataset to classify:
train loss: test loss:
Decision functions of first trees


What is this?

This is an interactive demonstration-explanation of gradient boosting algorithm applied to classification problem. Boosting takes a decision ('blue' or 'orange') by iteratively building many simpler classification algorithms (decision trees in our case).


Can you ..

  1. overfit the model? (from some point test loss should increase)
  2. achieve test loss < 0.02 on 'xor' dataset (second one) using depth = 2?
  3. same for depth = 1?
  4. achieve test loss < 0.1 on spiral dataset using minimal depth of trees?
  5. fit 'stripes' dataset with trees of depth 1? Can you explain why this is possible?
  6. get minimal possible loss on dataset with several embedded circles?


  • each time you change some parameter, gradient boosting is recomputed from the scratch
    (yes, gradient boosting algorithm is quite fast)
  • if learning rate is small, target doesn't change much. As a result, consequent trees have similar structure
  • to fight this, different randomizations are introduced:
    • subsampling (take random part of data to train each tree)
    • and random subspaces (take random subset of features to build a tree),
      only 2 variables in the demo, so random rotations are used instead of random subspaces.
  • from other demo you can find why large learning rate is a bad idea and small learning rate is recommended.
    • did you notice? If you set learning rate to be high (without using Newton-Raphson update) only several first trees make serious contribution, other trees are almost not used
  • datasets from other playgrounds are too easy for gradient boosting,
    that's why some challenging datasets were added
  • updating tree leaves using Newton-Raphson method is something typically ignored in the ML courses for simplicity, but in practice this is a cheap way to build efficient small ensembles.

There are many other things about GB you can find out from this demo. Enjoy!

Tracking the building of an ensemble

Gradient boosting is quite different from other optimization algorithms. One interesting detail is that you can effortlessly return to any point of training process after the ensemble is already built (i.e. we can't do this for neural networks).

Let's use this feature to understand boosting better.

  • after each tree is built, gradients are recomputed (hover cursor over any 'plus'!)
  • gradients for points in class 'orange' are always positive (and always negative for 'blue' points)
  • moduli of gradients are denoted by size of a corresponding point: the smaller the modulus, the better we classified data sample.
  • next tree is built using computed gradients. Tree is paying more attention to events with larger module of gradient (those, which were poorly classified on the previous stages).
  • sounds like AdaBoost, right?
  • (note: gradients are scaled when displayed. Typically, gradients are diminishing over time)