Link to the article: http://arxiv.org/abs/1505.00387

With proposed technique one can build very deep neural networks (up to hundreds of layers). The key principle is very simple: activation of next layer is computed using explicitly given activation previous layer $$ x_{n+1} = x_n + f(x_n), $$ where the second summand contains non-linearity (and this term is small enough, at least on firs iterations).

There are two points actually:

  • First, one uses very many layers, and is are able to approximate all needed functions.
  • Second, since the first summand dominates, there is no vanishing gradient problem.

Not sure if this really has some advantages over shallow ANNs, but still an interesting approach.

So, it's a way to train deep network, though doesn't have any attitude to what people usually call 'deep learning', since here we are not trying to establish some new hidden categories.

Update: this trick became popular due to Microsoft's ResNet architecture and its success. And passing separately one of previous activation today is called residual conections. Apparently, today this is what people call deep learning :/