Pairwise layer in Neural Networks
Just to describe one of my experiments with neural networks.
Neural networks initially were developed as simulation of real neurons, first training rules (i.e. Hebb's rule) were 'reproducing' the behaviour we observe in nature. Or at least those were reproducing our simplistic understanding of this process.
But I don't expect this approach to be very fruitful today. I prefer thinking of neural network as of just one of ways to define function.
    For instance, one-layer perceptron's activation function may be written down as
    $$f(x) = \sigma( a^i \, x_i )$$
    following the Einstein rule, I omit the summation over $i$. $a_i$ are weights.
    Activation function for two-layer perceptron ($a^i_j$ and $b^j$ are weights):
    $$ f(x) = \sigma( b^j \, \sigma( a^i_j \, x_i )) $$
    If one operates the vector variables, and $Ab$ is matrix-by-vector dot product, $\sigma x$ denotes element-wise
    sigmoid function, then activation function can be written down in a simple way:
    $$ f(x) = \sigma b \sigma A x $$
This is how one can define two-layer perceptron in theano, for instance. Three- or four- layer perceptron isn't more complicated really.
But defining function is only the part of the story - what about training of network? I'm sure that the most efficient algorithms won't come from neurobiology, but from pure mathematics. And that is how it is done in today's guides to neural networks: you define activation function, define some figure of merit (logloss for instance), and then use your favourite way of optimization.
I hope that soon the activation functions will be inspired by mathematics, though I didn't succeed much n this direction.
One of activation functions I tried is the following:
First layer: $$y = \sigma A x $$ Second (pairwise) layer: $$f(x) = \sigma (b^{ij} y_i y_j ) $$
The difference here that we can use now not only activation of neurons, but introduce some pairwise interaction between them. Unfortunately, I didn't feel much difference between this modification and a simple two-layer network.
Thank to theano, this is very simple to play with different activation functions :)
Well, I was wrong: after checking on higgs-boson dataset from kaggle I found out that this kind of neural network works much better than traditional ones! Hurrah!
Though, much worse then GBDT, but after building AdaBoost over neural network I was able to get comparable (or just the same) quality. The only problem is GBDT trained in minutes, while it took ~24 hours for boosting over NN to train.
 Gradient boosting
                Gradient boosting  Hamiltonian MC
                Hamiltonian MC  Gradient boosting
                Gradient boosting  Reconstructing pictures
                Reconstructing pictures  Neural Networks
                Neural Networks