Personally I am a big fan of numpy package, since it makes the code clean and still quite fast.
However I am much worried about the speed, so decided to collect different benchmarks

numpy vs cython vs weave (numpy is about 2 times slower than others)
(posted in 2011)

Primarily the post is about numba, the pairwise distances are computed with cython, numpy, numba.
Numba is claimed to be the fastest, around 10 times faster than numpy.
(posted in 2013)

Julia is claimed by its developers to be very fast language.
Well, if that is true, there would be no need in writing easily-vectorizable operation in pure python (yep, I mean they are simply cheating), currently these 'benchmarks' are at the main page

The following post was written in 2011, the problem observed is solution of Laplace equation as usual.
Numpy is ~10 times slower than pure C++ solution (and equal to matlab), weave shows the speed comparable with pure C++ (I don't know anyone using weave now)

Fresh (2014) benchmark of different python tools, simple vectorized expression A*B-4.1*A > 2.5*B is evaluated with numpy, cython, numba, numexpr, and parakeet (and two latest are the fastest - about 10 times less time than numpy, achieved by using multithreading with two cores)

Haven't found any general-purpose theano vs numpy benchmarks, but in the article there is comparison of neural networks and theano is expected to give much better speed than numpy/torch(c++)/matlab, specially it is fast on GPU

One more detailed review of numpy vs cython vs c (held in 2014)
Let me copy-paste the example results (computation of std).

MethodTime (ms)Compared to PythonCompared to Numpy
Pure Python183x1x0.03
Naive Cython7.76x24x0.8
Optimised Cython2.18x84x2.7
Cython calling C2.22x82x2.7

Simple, but comprehensive comparison of python accelerators was prepared by Jake Vanderplaas and Olivier Grisel:
Problem is computation of pairwise distances. The results this time seem to be unbiased:

Naive numpy64.7 ms
Parakeet12.3 ms
Cython6.57 ms

By the way, I was recently looking for a slow place in my code, and it totally dropped of my mind that computation of residual is long operation:
Integer division / remainder computation time significaly depends on the bit depth (16 bit operation are ~2 times faster than 32 bit ones, division takes roughly 10 times more time than multiplication).

Surprising: multiplication/addition/binary operations take the same time (compared on numpy 1.9)