Old-style optimization methods for neural networks

Dived deeper into the methods of training NNs.

Good yet incomplete list of what people did in this area is given in this article dated 2006.

Unfortunately there is no my favourite Rprop and it's modifications (IRprop+, IRprop-).

Also I recently spent some time on experiments with neural networks, and I decided to improve IRprop by keeping track about not simply moving along each axis, but as well along directions
$w_i + w_j$, $w_i - w_j$, being sure that this should speed up training progress.
It increased the speed for the first time, but very fast it stops decreasing loss function and when it is close to the minimal value, serious oscillations start and the optimization process becomes simply unstable. This method is implemented as experimental IRprop* trainer in `hep_ml`.

Update: this old post with a link from 2006 is quite obsolete and inappropriate for those interested in training deep networks, Instead, please have a look at this overview of methods for adaptive stochastic gradient optimization.

Top posts at "brilliantly wrong": (all posts)