Personally I am a big fan of numpy package, since it makes the code clean and still quite fast.
However I am much worried about the speed, so decided to collect different benchmarks

numpy vs cython vs weave (numpy is about 2 times slower than others)
(posted in 2011)

Primarily the post is about numba, the pairwise distances are computed with cython, numpy, numba.
Numba is claimed to be the fastest, around 10 times faster than numpy.
(posted in 2013)

Julia is claimed by its developers to be very fast language.
Well, if that is true, there would be no need in writing easily-vectorizable operation in pure python (yep, I mean they are simply cheating), currently these 'benchmarks' are at the main page

The following post was written in 2011, the problem observed is solution of Laplace equation as usual.
Numpy is ~10 times slower than pure C++ solution (and equal to matlab), weave shows the speed comparable with pure C++ (I don't know anyone using weave now)

Fresh (2014) benchmark of different python tools, simple vectorized expression A*B-4.1*A > 2.5*B is evaluated with numpy, cython, numba, numexpr, and parakeet (and two latest are the fastest - about 10 times less time than numpy, achieved by using multithreading with two cores)

Haven't found any general-purpose theano vs numpy benchmarks, but in the article there is comparison of neural networks and theano is expected to give much better speed than numpy/torch(c++)/matlab, specially it is fast on GPU

One more detailed review of numpy vs cython vs c (held in 2014)
Let me copy-paste the example results (computation of std).

Method Time (ms) Compared to Python Compared to Numpy
Pure Python 183 x1 x0.03
Numpy 5.97 x31 x1
Naive Cython 7.76 x24 x0.8
Optimised Cython 2.18 x84 x2.7
Cython calling C 2.22 x82 x2.7

Simple, but comprehensive comparison of python accelerators was prepared by Jake Vanderplaas and Olivier Grisel:
Problem is computation of pairwise distances. The results this time seem to be unbiased:

Python 9.51s
Naive numpy 64.7 ms
Numba 6.72ms
Cython 6.57ms
Parakeet 12.3 ms
Cython 6.57 ms

By the way, I was recently looking for a slow place in my code, and it totally dropped of my mind that computation of residual is long operation:
Integer division / remainder computation time significaly depends on the bit depth (16 bit operation are ~2 times faster than 32 bit ones, division takes roughly 10 times more time than multiplication).

Surprising: multiplication/addition/binary operations take the same time (compared on numpy 1.9), which I personally find very strange (this behavior may be caused by quite complicated addressing in numpy arrays, which becomes the bottleneck, but this is only a guess).