Recently worked on optimization of some (internal) classifier. The problem was mostly not in training, but in applying of trained classifier — this code was originally written in C++ and then translated to cython (which decreased the speed by factor of 2).

This was quite easy rewrite the code using numpy and vectorized approach (initially predictions were built event-by event, after rewriting the classifier was applied tree-by-tree). However this gave only speed comparable with original C++ code (and twice faster than cython version).

What really fastened the code is switching from int8 operations to int64 (the latest are natively supported in all modern processors). So 8 operations in int8 were grouped into one 64-bit operation.

This is pretty simple to perform by creating views:

In[1]: import numpy
In[2]: x = numpy.random.randint(0, 100, size=64000).astype(’int8′)
In[3]: y = numpy.random.randint(0, 100, size=64000).astype(’int8′)
In[4]: %timeit x & y
10000 loops, best of 3: 60.6 µs per loop
In[5]: %timeit x.view(’int64′) & y.view(’int64′)
100000 loops, best of 3: 12 µs per loop
# Checked that output is correct
In[6]: numpy.all( (x & y) == (x.view(’int64′) & y.view(’int64’)).view(’int8′) )
Out[6]: True


In this simple example we see 5x speed up. Views of course do not copy the data, which is very essential for the speed. This trick can be applied with: summation / subtraction / binary or / binary and, but you need that the size of original array was divisible by 8.