Old-style optimization methods for neural networks
Dived deeper into the methods of training NNs.
Good yet incomplete list of what people did in this area is given in this article dated 2006.
Unfortunately there is no my favourite Rprop and it's modifications (IRprop+, IRprop-).
    Also I recently spent some time on experiments with neural networks, and I decided to improve IRprop by keeping track about not
    simply moving along each axis, but as well along directions
$w_i + w_j$, $w_i - w_j$, being sure that this
    should speed up training progress.
    It increased the speed for the first time, but very fast it stops
    decreasing loss function and when it is close to the minimal value, serious oscillations start
    and the optimization process becomes simply unstable.
    This method is implemented as experimental IRprop* trainer in `hep_ml`.
Update: this old post with a link from 2006 is quite obsolete and inappropriate for those interested in training deep networks, Instead, please have a look at this overview of methods for adaptive stochastic gradient optimization.
 Gradient boosting
                Gradient boosting  Hamiltonian MC
                Hamiltonian MC  Gradient boosting
                Gradient boosting  Reconstructing pictures
                Reconstructing pictures  Neural Networks
                Neural Networks