Data manipulation with numpy: tips and tricks, part 2

More examples on fast manipulations with data using numpy.

First part may be found here.

In [1]:
from __future__ import print_function
import numpy as np

Rolling window, strided tricks

When working with time series / images it is frequently needed to do some operations on windows.

Simplest case: taking mean for running window:

In [2]:
sequence = np.random.normal(size=10000) + np.arange(10000)

Very bad idea is to do this with pure python

In [3]:
def running_average_simple(seq, window=100):
    result = np.zeros(len(seq) - window + 1)
    for i in range(len(result)):
        result[i] = np.mean(seq[i:i + window])
    return result

running_average_simple(sequence)
Out[3]:
array([   49.27774499,    50.25852659,    51.26151649, ...,  9947.32841776,
        9948.3391695 ,  9949.33539463])

A bit better is to use as_strided

In [4]:
from numpy.lib.stride_tricks import as_strided
def running_average_strides(seq, window=100):
    stride = seq.strides[0]
    sequence_strides = as_strided(seq, shape=[len(seq) - window + 1, window], strides=[stride, stride])
    return sequence_strides.mean(axis=1)
In [5]:
running_average_strides(sequence)
Out[5]:
array([   49.27774499,    50.25852659,    51.26151649, ...,  9947.32841776,
        9948.3391695 ,  9949.33539463])

From computation side, as_strided does nothing. No copies and no computations, it only gives new view to the data, which is two-dimensional this time.

numpy.cumsum

However the right way to compute mean over rolling window is using numpy.cumsum:

(this one is unbeatable in speed if n is not small)

In [6]:
def running_average_cumsum(seq, window=100):
    s = np.insert(np.cumsum(seq), 0, [0])
    return (s[window :] - s[:-window]) * (1. / window)
In [7]:
running_average_cumsum(sequence)
Out[7]:
array([   49.27774499,    50.25852659,    51.26151649, ...,  9947.32841776,
        9948.3391695 ,  9949.33539463])

See also for this purpose:

Remark: numpy.cumsum is equivalent to numpy.add.accumulate, but there are also:

  • numpy.maximum.accumulate, numpy.minimum.accumulate - running max and min
  • numpy.multiply.accumulate, which is equivalent to numpy.cumprod

Remark: for computing rolling mean, numpy.cumsum is best, however for other window statistics like min/max/percentile, use strides trick.

Strides and training on sequences

ML algorithms in python are often taking numpy.arrays. In many cases when working with sequences you need to pass some data many times as part of different chunks.

Example: you have exhange rates for a year, you want GBDT to predict next exchange rate based on the previous 10.

In [8]:
window = 10
rates = np.random.normal(size=1000)
# target in training
y = rates[window:]

Typically the solution used is:

In [9]:
X1 = np.zeros([len(rates) - window, window])
for day in range(len(X1)):
    X1[day, :] = rates[day:day + window]

But strided tricks are better way, since they don't need additional space:

In [10]:
stride, = rates.strides
X2 = as_strided(rates, [len(rates) - window , window], strides=[stride, stride])
In [11]:
np.allclose(X1, X2)
Out[11]:
True

And if I want to pass not each window, but with some gap (to speed up training usually), I use slicing, which again does no operations with data:

In [12]:
X_new = X2[::3]
y_new = y[::3]

Strides and images

The same tricks work with images. For instance, for each window of size [window_w, window_h] we want to compute max, mean and variance. Let's compare naive and strided approach.

In [13]:
def compute_window_mean_and_var(image, window_w, window_h):
    w, h = image.shape
    w_new, h_new = w - window_w + 1, h - window_h + 1
    means = np.zeros([w_new, h_new])
    maximums = np.zeros([w_new, h_new])
    variations = np.zeros([w_new, h_new])
    for i in range(w_new):
        for j in range(h_new):
            window = image[i:i+window_w, j:j+window_h]
            means[i, j] = np.mean(window)
            maximums[i, j] = np.max(window) 
            variations[i, j] = np.var(window) 
    return means, maximums, variations    
In [14]:
def compute_window_mean_and_var_strided(image, window_w, window_h):
    w, h = image.shape
    strided_image = np.lib.stride_tricks.as_strided(image, 
                                                    shape=[w - window_w + 1, h - window_h + 1, window_w, window_h],
                                                    strides=image.strides + image.strides)
    # important: trying to reshape image will create complete 4-dimensional compy
    means = strided_image.mean(axis=(2,3)) 
    mean_squares = (strided_image ** 2).mean(axis=(2, 3)) 
    maximums = strided_image.max(axis=(2,3))
    
    variations = mean_squares - means ** 2
    return means, maximums, variations    
In [15]:
image = np.random.random([500, 500])
In [16]:
%time a1, b1, c1 = compute_window_mean_and_var(image, 20, 20)
CPU times: user 11.1 s, sys: 230 ms, total: 11.4 s
Wall time: 12.8 s
In [17]:
%time a2, b2, c2 = compute_window_mean_and_var_strided(image, 20, 20)
CPU times: user 626 ms, sys: 282 ms, total: 908 ms
Wall time: 928 ms
In [18]:
np.allclose(a1, a2), np.allclose(b1, b2), np.allclose(c1, c2)
Out[18]:
(True, True, True)

Remark: numpy.cumsum allows awesomely fast computations in this case as well (but not for the max). Find out how!

Oblivious decision trees

Oblivious decision tree is tree, which has the same splits in all nodes at particular depth.

ODT of depth $n$ consists of $n$ pairs (feature index, split) and $2^n$ real values, which correspond to predictions in each leaf.

In [19]:
def predict_odt(data, features, splits, leaf_values):
    # first compute leaf for each sample in data
    leaf_indices = np.zeros(len(data), dtype=int)
    for feature, split in zip(features, splits):
        leaf_indices *= 2
        leaf_indices += data[:, feature] > split
        
    predictions = leaf_values[leaf_indices]
    return predictions

That's it. It was really simple, now let's create some fake data and fake ODT to measure speed.

In [20]:
data = np.random.normal(size=[100000, 11])

depth = 7
features = np.arange(depth)
splits = np.random.normal(size=depth).astype('float32')
leaf_values = np.random.random(2 ** depth).astype('float32')
In [21]:
%timeit predict_odt(data, features, splits, leaf_values)
100 loops, best of 3: 6.45 ms per loop

This is good result, but can we do this faster? Let's apply bitwise operations instead of arithmetical.

In [22]:
def predict_odt_bit_operations(data, features, splits, leaf_values):
    leaf_indices = np.zeros(len(data), dtype='uint8')
    for feature, split in zip(features, splits):
        leaf_indices <<= 1
        leaf_indices |= data[:, feature] > split
        
    predictions = leaf_values[leaf_indices]
    return predictions
In [23]:
%timeit predict_odt_bit_operations(data, features, splits, leaf_values)
100 loops, best of 3: 5.43 ms per loop

Ok, but what if we apply bitwise operations for 64-bit integers? This should require 8 times less operations, because currently there are 8-bit operations (not handled natively by 64bit procesors).

Result of computation is the same, but this requires the length of data to be divisible by 8.

This trick significantly speeds up shift << and bitwise or | in the code, and the major part of time (>80%) is now spent on checking conditions.

In [24]:
def predict_odt_fast(data, features, splits, leaf_values):
    leaf_indices = np.zeros(len(data), dtype='uint8').view('uint64')
    for feature, split in zip(features, splits):
        leaf_indices <<= 1
        leaf_indices |= (data[:, feature] > split).view('uint64')
    
    predictions = leaf_values[leaf_indices.view('uint8')]
    return predictions
In [25]:
%timeit predict_odt_fast(data, features, splits, leaf_values)
100 loops, best of 3: 4.42 ms per loop

Banks in RAM, strides and order of elements

Data in numpy is stored as continuous chunk of data. This gives high flexibility (instant transpositions, reshaping, views and strided tricks), but sometimes we need to keep eye on strides (in particular, orders of dimensions).

In the following simple example:

In [26]:
data = np.random.random([100000, 64]).astype(order='C', dtype='float32')
%timeit predict_odt_fast(data, features, splits, leaf_values)

data = np.random.random([100000, 64]).astype(order='F', dtype='float32')
%timeit predict_odt_fast(data, features, splits, leaf_values)
100 loops, best of 3: 6.89 ms per loop
1000 loops, best of 3: 754 µs per loop

How happened we have this huge difference in speed?

Operations with sequential data is faster due to CPU cache. By default, numpy uses C-order, thus elements from one column are far each from other, while elements within row are placed in memory together.

C-order is fine when operating rows, while our ODT code uses column-wise operations.

The actual speed up may significantly differ on the structure of your CPU cache, but anyway is expected to be high.

Specially this is pressing when stride is some power of 2 due to banks of memory, which are used to get parallel access to memory. In C-order the distance between two closest elements in a column:

In [27]:
np.random.random([10000, 64]).astype(order='C', dtype='float32').strides[0]
Out[27]:
256

stride is 256. We will use each time single memory bank, which is worst situation possible.

Conclusion: order of indexing matters. Mind the cache.

Computing optimal step for AdaLoss function

In GBDT there is important step after training a tree is finding optimal values in leaves.

AdaLoss (also known as ExpLoss) can be written as (here $y_i = \pm 1$):

$$ \sum_i \exp[-y_i d(x_i)]$$

Contrary to other losses, for AdaLoss one can write exact expression of optimal value of a leaf:

$$ \text{leaf\_value} = \frac{1}{2} \log \left( \dfrac{w_{leaf,+}}{w_{leaf, -}} \right), $$

where $w_{leaf,+}, w_{leaf, -}$ are weights of signal and background events in the leaf. Let's write this procedure in beautiful numpy code:

In [28]:
def compute_optimal_leaf_values(terminal_regions, labels, predictions):
    """
    terminal regions are integers, correspong to the number of leaf each event belongs to
    labels are +1 and -1
    predictions are real numbers - sum of predictions given by previous trees in boosting
    """
    weights = np.exp(- labels * predictions)
    w_plus = np.bincount(terminal_regions, weights * (labels == +1))
    w_minus = np.bincount(terminal_regions, weights * (labels == -1))
    return np.log(w_leafs / w_minus) / 2.

Logistic function

Logistic function is frequently used in machine learning to convert real-valued output to probability: $$ \sigma (x) = \dfrac{e^x}{e^x + 1} = \dfrac{1}{1 + e^{-x}} $$

However, it is a bad idea to use this definition directly, because:

In [29]:
def logistic(x):
    return np.exp(x) / (1 + np.exp(x))
In [30]:
logistic(1000)
/Users/axelr/.venvs/rep/lib/python2.7/site-packages/ipykernel/__main__.py:2: RuntimeWarning: overflow encountered in exp
  from ipykernel import kernelapp as app
/Users/axelr/.venvs/rep/lib/python2.7/site-packages/ipykernel/__main__.py:2: RuntimeWarning: invalid value encountered in double_scalars
  from ipykernel import kernelapp as app
Out[30]:
nan

First expression fails. Second expression raises warning, but returns correct answer (but earlier numpy returned Nan in this case as well).

Instead use scipy.special.expit, which is fast and reliable:

In [31]:
from scipy.special import expit
In [32]:
expit([-1000, 0, 1000])
Out[32]:
array([ 0. ,  0.5,  1. ])

Logistic loss function

being expressed via decision function, logistic loss for single event has the following expression:

$$ LogLoss = \log(1 + e^{-yd}) $$

For simplicity, let's implement the function $$\log(1 + e^x)$$

In [33]:
def logloss(x):
    return np.log(1 + np.exp(x))
In [34]:
logloss([1, 10, 100, 1000])
/Users/axelr/.venvs/rep/lib/python2.7/site-packages/ipykernel/__main__.py:2: RuntimeWarning: overflow encountered in exp
  from ipykernel import kernelapp as app
Out[34]:
array([   1.31326169,   10.0000454 ,  100.        ,           inf])

Whoops, something went wrong. Of course, there was an overflow after applying exponent. Resulting number didn't fit in float64.

Use numpy.logaddexp to avoid these problems!

In [35]:
np.logaddexp(0, [1, 10, 100, 1000])
Out[35]:
array([    1.31326169,    10.0000454 ,   100.        ,  1000.        ])

Remark: numpy.logaddexp.at is a good replacement for numpy.maximum.at if you want to keep 'soft' maximum.

Masked array

numpy doesn't support jagged arrays. But this can be partially resolved by using masked arrays.

Imagine the following situation: there are $n$ houses, we want to select location to build new house. For each position we analyze the price of $k$ neighbouring houses to guess the best place.

In [36]:
houses_prices = np.random.exponential(size=10000)
In [37]:
n_neighs = 20
position_neighbours = np.random.randint(0, len(houses_prices), size=[100000, n_neighs])

computing average price of nearest houses is fairly simple:

In [38]:
houses_prices[position_neighbours].mean(axis=1)
Out[38]:
array([ 0.85558849,  0.85929689,  1.18028179, ...,  1.02167602,
        0.96623028,  1.04202728])

same method is used to get total/min/max/std price of closest houses.

The problem is compute these things when part of the data shall be ignored. For instance, houses which are very expensive (top 5%) are probably expensive not due to their position. Can we ignore them when computing average?

Masked arrays come to the rescue:

In [39]:
# ignore houses with higher price
max_price = np.percentile(houses_prices, 95)

data = houses_prices[position_neighbours]
ignored = houses_prices[position_neighbours] > max_price

# pay attention that data[~ignored] will return one-dimensional array,
# that's why we need masked arrays
masked_prices = np.ma.MaskedArray(data=data, mask=ignored)

masked_prices.mean(axis=1)
Out[39]:
masked_array(data = [0.6948163814376606 0.8592968871648994 1.0309823904276911 ...,
 0.8899183280916126 0.8485975161178317 1.0420272772084531],
             mask = [False False False ..., False False False],
       fill_value = 1e+20)
In [40]:
np.array(masked_prices.mean(axis=1))
Out[40]:
array([ 0.69481638,  0.85929689,  1.03098239, ...,  0.88991833,
        0.84859752,  1.04202728])

Numpy.ufunc.reduceat to work with multicategories.

Multicategory is type of feature: each sample in dataset can belong to several (possibly zero) groups.

For instance, samples are people, and we have informations about which languages they are programing. Programming languages are multicategory.

Assume we have two parallel arrays, corresponding pair denote programmer id and language id he/she knows.

In [41]:
salaries = np.random.random(2000)
programers = np.sort(np.random.randint(0, len(salaries), size=10000))
languages = np.random.randint(0, 50, size=len(programers))

print(programers)
print(languages)
[   0    0    0 ..., 1999 1999 1999]
[15  8  3 ..., 11 11 28]

Let's compute for each language average salary of its users:

In [42]:
lang_average_salaries = np.bincount(languages, salaries[programers]) / np.bincount(languages)

For fun, now we can compute the salary programmer can obtain for best-payed of his languages:

In [43]:
programmer_top_salary = np.zeros(programers.max() + 1)
np.maximum.at(programmer_top_salary, programers, lang_average_salaries[languages])

print(programmer_top_salary)
[ 0.50898484  0.48684316  0.51820748 ...,  0.54239297  0.54239297
  0.51820748]

Get the id of best-payed language

We can guess the language based on price, but this can be inreliable in other cases.

Let's combine tricks studied earlier:

In [44]:
lang_sorter = np.argsort(lang_average_salaries)
# languages ranked by average salaries:
lang_salaries_ordered = np.argsort(lang_sorter)

programmer_top_language = np.zeros(programers.max() + 1, dtype=int)
np.maximum.at(programmer_top_language, programers, lang_salaries_ordered[languages])

# now we need to decode order of language back to original ID
programmer_top_language = lang_sorter[programmer_top_language]
In [45]:
programmer_top_language
Out[45]:
array([ 3,  2,  7, ..., 27, 27,  7])

Checking with previous result — taking average salaries for top languages.

NB: there were programmers with no languages - their 'top language' became the worst payed one. Maybe worth creating special pseudo-language to denote this situations.

In [46]:
lang_average_salaries[programmer_top_language]
Out[46]:
array([ 0.50898484,  0.48684316,  0.51820748, ...,  0.54239297,
        0.54239297,  0.51820748])

numpy.ufunc.reduceat

Numpy.maximum.at is not fast at this moment, though very convenient.

Let's show faster, but more complicated way using:

In [47]:
sorter = np.argsort(programers)
ordered_programers = programers[sorter]
ordered_languages = languages[sorter]
In [48]:
limits_in_programmers = np.insert(np.bincount(programers).cumsum()[:-1], 0, [0])
programmer_top_salary = np.maximum.reduceat(lang_average_salaries[ordered_languages], limits_in_programmers) 

Remark:

  • numpy.ufunc.reduceat has inobvious behavior when the list of categories is empty. In this case the value for first category of next sample is returned.
  • significant degradation in speed happens when too few categories for samples. When one category per sample, speed is comparable to numpy.ufunc.at

Sparse arrays and multicategories

There is an alternative to numpy.bincount, specially handy when using 'weights'.

Minor modification of previous example: each programmer has the skill value, describing how well he knows programming language (and plays a role of weight).

In [49]:
skills = np.random.random(len(languages)) 

Sparse matrices contain all this information together:

In [50]:
from scipy.sparse import coo_matrix
matrix = coo_matrix((skills, (programers, languages)))

Easy tricks with normalizations are strong side of sparse matrices.

Let's compute average salary with weights proportional to 'skill'

In [51]:
matrix_normed_for_language = matrix / (matrix.sum(axis=0) + 0.01)
lang_average_salaries = matrix_normed_for_language.T.dot(salaries)
In [52]:
matrix.dot(lang_average_salaries.T)
Out[52]:
matrix([[ 1.98991879],
        [ 0.30377964],
        [ 1.22430908],
        ..., 
        [ 1.42907617],
        [ 1.78078234],
        [ 1.80980639]])

Sparse matrices are slower if used only once (compared to bincount), because of matrix creation, but faster when applied many times.

Fast computing of efficiencies for group

In uBoost algorithm we need to compute many times local efficiencies: for each event which part of its neighbors pass the given threshold. Sparse matrix is a good option in this case. Close things happen in other classifiers (uGB+kNN, RankBoost)

As an exercise, let's compute which part of programmers knowing language X have salary greater then salary_threshold (printing result only for first 5 languages):

In [53]:
for salary_threshold in np.linspace(0, 1, 10):
    print(matrix_normed_for_language.T.dot(salaries > salary_threshold)[0, :5])
[[ 0.99989806  0.99991087  0.99990149  0.99988927  0.99988925]]
[[ 0.84027866  0.85943743  0.87756901  0.86609172  0.899259  ]]
[[ 0.71429575  0.77817814  0.74757531  0.75081062  0.77426352]]
[[ 0.61629482  0.6186105   0.61793111  0.63928084  0.65908296]]
[[ 0.52584106  0.50824714  0.50906984  0.55588072  0.53273126]]
[[ 0.41138264  0.35382012  0.44019973  0.44312564  0.43501221]]
[[ 0.29691138  0.24084839  0.31622935  0.37518756  0.33608273]]
[[ 0.21323651  0.18827297  0.2335139   0.28325233  0.22044709]]
[[ 0.1371795   0.06998397  0.1376106   0.14013846  0.09387452]]
[[ 0.  0.  0.  0.  0.]]

Fast residual

when having deal with words or other categorical variables, there are frequently troubles with applying one-hot encoding (don't want to keep the mapping or too many parameters for model).

Hashing trick is good alternative: each value is mapped by hash to small number.

Funny fact is that integer resudial is not very cheap operation, though shall be applied many times:

In [54]:
categories = np.random.randint(0, 1e10, size=1000000)
In [55]:
%timeit categories % 128
100 loops, best of 3: 13.7 ms per loop

if you select the number of left features to be power of 2, you can do the same with bitmask:

In [56]:
%timeit categories & 127
100 loops, best of 3: 3.01 ms per loop

how about rocket speed? (please don't apply in practice, just to show this is not the limit)

In [57]:
%%timeit
mask = np.array(127, dtype='int64')
for _ in range(8):
    mask |= mask << 8
result = (categories.view('uint8')[::8].copy().view('uint64') & mask).view('uint8')
1000 loops, best of 3: 1.02 ms per loop

please keep such simple bit-tricks in mind if you're going to implement some algorithm.

Online statistics

Finally, let's implement online statistics computation. In this case we can use only information about past.

For instance, for each category we can compute how many times it was used in the past. This is called online counter.

Online counter

In [58]:
# plain python approach
from collections import defaultdict

def compute_online_counter(categories_stream):
    online_counter = np.zeros_like(categories_stream)
    counters = defaultdict(int)
    for i in range(len(categories_stream)):
        online_counter[i] = counters[categories_stream[i]]
        counters[categories_stream[i]] += 1
    return online_counter

Simple example for online counter

each time we compute how many times each number (representing a category) presented before in the stream

In [59]:
compute_online_counter([1, 2, 1, 1, 3, 2, 5, 6, 3, 3, 3])
Out[59]:
array([0, 0, 1, 2, 0, 1, 0, 0, 1, 2, 3])
In [60]:
categories_stream = np.random.randint(0, 10000, size=1000000)
In [61]:
%timeit online_counter = compute_online_counter(categories_stream)
1 loop, best of 3: 609 ms per loop
In [62]:
def compute_online_counter_numpy(categories_stream):
    occurences = np.bincount(categories_stream)
    before = occurences.cumsum() - occurences
    positions = categories_stream.argsort(kind='mergesort').argsort(kind='mergesort')
    return positions - before[categories_stream]
In [63]:
%timeit compute_online_counter_numpy(categories_stream)
1 loop, best of 3: 264 ms per loop

this is the situation when numpy is really not of much help. Sorting applied twice is bad approach when no sorting shall be used.

pandas suggests a good one-line alternative, but it is ..hmhm.. a bit slow.

In [64]:
import pandas
df = pandas.DataFrame({'cat':  categories_stream, 'ones': np.ones(len(categories_stream))})
# truncating to first 10000, otherwise it will never finish
df = df[:10000]
%time result = df.groupby('cat')['ones'].transform(pandas.Series.cumsum)
CPU times: user 956 ms, sys: 5.42 ms, total: 962 ms
Wall time: 965 ms

Ok, then it is time to use some external tool. Let's do it with parakeet.

In [65]:
import parakeet
parakeet.config.backend = 'c' # one thread

@parakeet.jit
def compute_online_counter_parakeet(categories_stream):
    counters = np.zeros(np.max(categories_stream) + 1)
    online_counter = np.zeros_like(categories_stream)
    for i in range(len(categories_stream)):
        online_counter[i] = counters[categories_stream[i]]
        counters[categories_stream[i]] += 1
    return online_counter
In [66]:
compute_online_counter_parakeet(categories_stream)
%timeit compute_online_counter_parakeet(categories_stream)
100 loops, best of 3: 9.7 ms per loop

However, keep in mind, that you'll probably not be able to use this code in windows OS. Use external tools only when they are really needed.

Conclusion

  • numpy is very powerful if properly used.
  • learning numpy is harder than a + b, but it gives benefit in shortness and readability
  • there are very few cases when you really need to switch from python code to cython or c during research.
  • many of things from above can be implemented in other vector environments: theano / matlab / R

Links

There are many more things to know about numpy. Here are the some helpful links you may also be interested in:

  1. Getting the Best Performance out of NumPy
  2. Game of life and numpy reference
  3. 100 numpy exercises
  4. Advanced numpy
  5. Tentative numpy tutorial
  6. IPython cookbook

This post was written in IPython. You can download the notebook from repository.