Testing implementations of LibFM

By LibFM I mean an approach to solve classification and regression problems. This approach is frequently used in recommendation systems, because it generalizes the matrix decompositions.

LibFM proved to be quite useful to deal with highly categorical data (user id / movie id / movie language / site id / advertisement id / etc.).

Implementations tested

  • Original and widely known implementation was written by Steffen Rendle (and available on github).
    • contains SGD, SGDA, ALS and MCMC optimizers
    • command-line interface, does not have official python / R wrapper
    • does not provide a way to save / load trained formula. Each time you want to predict something, you need to restart training process
    • has some extensions (that almost nobody uses)
    • supports linux, mac os
    • has non-oficial pythonic wrapper
  • FastFM (github repo)
    • claimed to be faster in the author's article
    • has both command-line interface and convenient python wrapper, which almost follows scikit-learn conventions.
    • supports SGD, ALS and MCMC optimizers
    • supports save / load (for the except of MCMC)
    • supports linux, mac os (though some issues with mac os)
  • pylibFM (github repo)
    • uses SGDA
    • pythonic library implemented with cython
    • save / load operates normally
    • supports any platform, provided cython operates normally
    • slow and requires additional tuning, the number of iterations is reduced for pylibFM in tests

None of the libraries are pip-installable and all libraries need some manual setup. FastFM is the only to install itself normally into site-packages.

What is tested

ALS (alternating least squares) is very useful optimization technique for factorization models, however there is still one parameter one has to pass - namely, regularization. Quality of classification / regression is quite sensible to this parameter, so for fast tests data analyst prefers to leave the question of selecting regularization to machine learning.

MCMC is usually proposed as a solution: optimization algorithm should "find" the optimal regularization. MCMC uses however some priors (which don't influence the result that much).

So I am testing the quality libraries provide without additional tuning to check how bayesian inference and other heuristics work.

Logistic regression

Logistic regression is used as a stable baseline, because it is basic method to work with highly categorical data.

However, logistic regression, for instance, does not encounter the relation between user variables and movie variables (in the context of movie recommendations), so this approach is not able to provide any senseful recommendations.

In [1]:
import numpy
import pandas
import load_problems
import cPickle as pickle
from sklearn.metrics import roc_auc_score, mean_squared_error
In [2]:
from fastFM.mcmc import FMClassification, FMRegression
from sklearn.preprocessing import OneHotEncoder
from sklearn.linear_model import LogisticRegression, Ridge
from sklearn.datasets import dump_svmlight_file

Defining functions for benchmarking

In [3]:
LIBFM_PATH = '/moosefs/ipython_env/python_libfm/bin/libFM'
PYLIBFM_PATH = '/moosefs/ipython_env/python_pylibFM/'

import sys
if PYLIBFM_PATH not in sys.path:
    sys.path.insert(0, PYLIBFM_PATH)
import pylibfm


def fitpredict_logistic(trainX, trainY, testX, classification=True, **params):
    encoder = OneHotEncoder(handle_unknown='ignore').fit(trainX)
    trainX = encoder.transform(trainX)
    testX = encoder.transform(testX)
    if classification:
        clf = LogisticRegression(**params)
        clf.fit(trainX, trainY)
        return clf.predict_proba(testX)[:, 1]
    else:
        clf = Ridge(**params)
        clf.fit(trainX, trainY)
        return clf.predict(testX)

def fitpredict_fastfm(trainX, trainY, testX, classification=True, rank=8, n_iter=100):
    encoder = OneHotEncoder(handle_unknown='ignore').fit(trainX)
    trainX = encoder.transform(trainX)
    testX = encoder.transform(testX)
    if classification:
        clf = FMClassification(rank=rank, n_iter=n_iter)
        return clf.fit_predict_proba(trainX, trainY, testX)
    else:
        clf = FMRegression(rank=rank, n_iter=n_iter)
        return clf.fit_predict(trainX, trainY, testX)  

def fitpredict_libfm(trainX, trainY, testX, classification=True, rank=8, n_iter=100):
    encoder = OneHotEncoder(handle_unknown='ignore').fit(trainX)
    trainX = encoder.transform(trainX)
    testX = encoder.transform(testX)
    train_file = 'libfm_train.txt'
    test_file = 'libfm_test.txt'
    with open(train_file, 'w') as f:
        dump_svmlight_file(trainX, trainY, f=f)
    with open(test_file, 'w') as f:
        dump_svmlight_file(testX, numpy.zeros(testX.shape[0]), f=f)
    task = 'c' if classification else 'r'
    console_output = !$LIBFM_PATH -task $task -method mcmc -train $train_file -test $test_file -iter $n_iter -dim '1,1,$rank' -out output.libfm
    
    libfm_pred = pandas.read_csv('output.libfm', header=None).values.flatten()
    return libfm_pred

def fitpredict_pylibfm(trainX, trainY, testX, classification=True, rank=8, n_iter=10):
    encoder = OneHotEncoder(handle_unknown='ignore').fit(trainX)
    trainX = encoder.transform(trainX)
    testX = encoder.transform(testX)
    task = 'classification' if classification else 'regression'
    fm = pylibfm.FM(num_factors=rank, num_iter=n_iter, verbose=False, task=task)
    if classification:
        fm.fit(trainX, trainY)
    else:
        fm.fit(trainX, trainY * 1.)
    return fm.predict(testX)

Executing all of the tests takes much time

Below is simple mechanism, which preserves results between runs.

In [4]:
from collections import OrderedDict
import time

all_results = OrderedDict()
try:
    with open('./saved_results.pkl') as f:
        all_results = pickle.load(f)
except:
    pass

def test_on_dataset(trainX, testX, trainY, testY, task_name, classification=True, use_pylibfm=True):
    algorithms = OrderedDict()
    algorithms['logistic'] = fitpredict_logistic
    algorithms['libFM']    = fitpredict_libfm
    algorithms['fastFM']   = fitpredict_fastfm
    if use_pylibfm:
        algorithms['pylibfm']  = fitpredict_pylibfm
    
    results = pandas.DataFrame()
    for name, fit_predict in algorithms.items():
        start = time.time()
        predictions = fit_predict(trainX, trainY, testX, classification=classification)
        spent_time = time.time() - start
        results.ix[name, 'time'] = spent_time
        if classification:
            results.ix[name, 'ROC AUC'] = roc_auc_score(testY, predictions)
        else:
            results.ix[name, 'RMSE'] = numpy.mean((testY - predictions) ** 2) ** 0.5
            
    all_results[task_name] = results
    with open('saved_results.pkl', 'w') as f:
        pickle.dump(all_results, f)
        
    return results

Testing on movielens-100k dataset, only ids

MovieLens dataset is famous dataset in recommender systems. The task is to predict ratings for movies

In [5]:
trainX, testX, trainY, testY = load_problems.load_problem_movielens_100k(all_features=False)
trainX.head()
Out[5]:
user movie
98980 810 900
69824 803 754
9928 51 286
75599 734 180
95621 896 95
In [6]:
test_on_dataset(trainX, testX, trainY, testY, task_name='ml100k, ids', classification=False)
Out[6]:
time RMSE
logistic 0.059469 0.942771
libFM 8.970990 0.913520
fastFM 4.840041 0.915184
pylibfm 13.157164 0.944870

Testing on movielens-100k dataset, with additional information

In [7]:
trainX, testX, trainY, testY = load_problems.load_problem_movielens_100k(all_features=True)
trainX.head()
Out[7]:
user movie age gender occupation zip released unknown Action Adventure ... Fantasy Film-Noir Horror Musical Mystery Romance Sci-Fi Thriller War Western
98980 692 1310 33 0 7 615 68 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
69824 931 528 48 1 3 59 57 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
9928 216 553 12 1 13 110 67 0 1 1 ... 0 0 0 0 0 0 0 0 0 0
75599 798 498 39 0 0 166 30 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
95621 910 547 27 0 20 397 68 0 0 0 ... 1 0 0 0 0 0 0 0 0 0

5 rows × 26 columns

In [8]:
test_on_dataset(trainX, testX, trainY, testY, task_name='ml100k', classification=False)
Out[8]:
time RMSE
logistic 1.869114 0.942377
libFM 49.632649 0.896349
fastFM 53.611804 0.896543
pylibfm 55.756278 NaN

Testing on movielens-1m dataset, only ids

In [14]:
trainX, testX, trainY, testY = load_problems.load_problem_movielens_1m(all_features=False)
trainX.head()
load_problems.py:73: ParserWarning: Falling back to the 'python' engine because the 'c' engine does not support regex separators; you can avoid this warning by specifying engine='python'.
  names=['user', 'movie', 'rating', 'timestamp'], header=None)
Out[14]:
user movie
610738 3703 3541
324752 1923 756
808217 4836 1288
133807 866 1106
431857 2630 2857
In [15]:
test_on_dataset(trainX, testX, trainY, testY, task_name='ml-1m,ids', classification=False)
Out[15]:
time RMSE
logistic 1.111601 0.910718
libFM 275.672684 0.861539
fastFM 307.400295 0.858305
pylibfm 132.618739 0.870263

Testing on movielens-1m dataset, with additional information

In [ ]:
trainX, testX, trainY, testY = load_problems.load_problem_movielens_1m(all_features=True)
trainX.head()
load_problems.py:78: ParserWarning: Falling back to the 'python' engine because the 'c' engine does not support regex separators; you can avoid this warning by specifying engine='python'.
  names=['user', 'gender', 'age', 'occupation', 'zip'], header=None)
load_problems.py:79: ParserWarning: Falling back to the 'python' engine because the 'c' engine does not support regex separators; you can avoid this warning by specifying engine='python'.
  movies = pandas.read_csv(folder + '/movies.dat', sep='::', names=['movie', 'title', 'genres'], header=None)
Out[ ]:
user movie gender age occupation zip 0 1 2 3 ... 10 11 12 13 14 15 16 17 18 19
610738 5245 2240 0 1 0 2168 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
324752 4260 2213 0 3 6 346 1 1 0 0 ... 0 0 0 0 0 1 0 0 0 0
808217 481 1205 1 2 14 1878 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
133807 973 1108 1 3 19 3140 1 1 0 0 ... 0 0 0 0 0 0 0 0 0 0
431857 1344 2302 0 2 2 2336 0 0 0 0 ... 0 0 0 0 0 1 0 0 0 0

5 rows × 26 columns

In [34]:
test_on_dataset(trainX, testX, trainY, testY, task_name='ml-1m', classification=False)
Out[34]:
time RMSE
logistic 23.983249 0.911024
libFM 779.900802 0.850382
fastFM 1170.468130 0.852738
pylibfm 564.922632 NaN

Test on flights dataset - 1m

Flights dataset is quite famous due to these benchmarks by szilard.

Based on defferent charateristics the goal is to predict whether the flight was delayed by 15 minutes or more.

In [17]:
trainX, testX, trainY, testY = load_problems.load_problem_flight(large=False, convert_to_ints=False)
trainX.head()
Out[17]:
Month DayofMonth DayOfWeek DepTime UniqueCarrier Origin Dest Distance
0 c-4 c-26 c-2 1828 XE LEX IAH 828
1 c-12 c-11 c-1 1212 UA DEN MCI 533
2 c-10 c-1 c-6 935 OH HSV CVG 325
3 c-11 c-26 c-6 930 OH JFK PNS 1028
4 c-12 c-6 c-2 1350 MQ DFW LBB 282
In [24]:
trainX, testX, trainY, testY = load_problems.load_problem_flight(large=False, convert_to_ints=True)
trainX.head()
Out[24]:
Month DayofMonth DayOfWeek DepTime UniqueCarrier Origin Dest Distance
0 7 19 2 8 21 157 133 6
1 4 3 1 4 18 79 172 4
2 2 1 6 2 15 129 72 2
3 3 19 6 2 15 147 220 7
4 4 28 2 5 13 80 155 2
In [23]:
test_on_dataset(trainX, testX, trainY, testY, task_name='flight1m', classification=True)
Out[23]:
time ROC AUC
logistic 26.408147 0.715099
libFM 441.534214 0.720484
fastFM 667.197644 0.718840
pylibfm 316.149734 0.711200

Test on flights dataset - 10m

pylibFM drops the kernel, so doesn't participate in comparison

In [ ]:
trainX, testX, trainY, testY = load_problems.load_problem_flight(large=True, convert_to_ints=True)
trainX.head()
Out[ ]:
Month DayofMonth DayOfWeek DepTime UniqueCarrier Origin Dest Distance
0 8 3 4 0 15 37 29 1
1 6 16 3 8 1 149 157 9
2 3 13 7 0 13 29 37 1
3 11 27 6 4 14 80 198 7
4 7 19 2 1 12 216 155 9
In [11]:
test_on_dataset(trainX, testX, trainY, testY, task_name='flight10m', classification=True, use_pylibfm=False)
Out[11]:
time ROC AUC
logistic 307.085924 0.715754
libFM 8500.038059 0.724258
fastFM 10718.261802 0.721615

Flights dataset with additional features

We simply add some 'quadratic' features

In [63]:
trainX, testX, trainY, testY = load_problems.load_problem_flight_extended(large=False)
trainX.head()
Out[63]:
Month DayofMonth DayOfWeek DepTime UniqueCarrier Origin Dest Distance UniqueCarrier_Origin UniqueCarrier_Dest UniqueCarrier_DepTime Origin_Dest Origin_DepTime Dest_DepTime
0 7 19 2 19 21 157 133 7 1628 1622 496 2606 2612 2275
1 4 3 1 13 18 79 172 5 1342 1369 417 1267 1294 2921
2 2 1 6 10 15 129 72 3 1072 1050 344 1985 2117 1195
3 3 19 6 10 15 147 220 8 1083 1111 344 2363 2410 3807
4 4 28 2 14 13 80 155 3 806 836 299 1379 1317 2665
In [64]:
test_on_dataset(trainX, testX, trainY, testY, task_name='flight1m, ext', classification=True)
Out[64]:
time ROC AUC
logistic 130.553217 0.739113
libFM 1284.598730 0.760524
fastFM 2398.107206 0.748448
pylibfm 514.768003 0.743312

Test on Avazu dataset (100k)

Avazu dataset comes from kaggle challenge, goal is to predict Click-Through Rate.

All the variables given are categorical, LibFM gave good results in this challenge.

In [9]:
trainX, testX, trainY, testY = load_problems.load_problem_ad(train_size=100000)
# taking max hash of 1000 for each category
trainX = trainX % 1000
testX = testX % 1000
trainX.head()
Out[9]:
hour C1 banner_pos site_id site_domain site_category app_id app_domain app_category device_id ... device_conn_type C14 C15 C16 C17 C18 C19 C20 C21 day
13458308 1 2 0 582 339 2 884 254 0 800 ... 0 281 3 2 56 0 2 0 22 24
11331426 10 2 0 582 339 2 884 254 0 800 ... 1 283 3 2 56 0 2 0 22 23
1271792 6 2 0 879 903 24 884 254 0 800 ... 0 84 3 2 18 3 14 57 6 21
7618971 13 2 0 762 841 24 884 254 0 800 ... 0 88 3 2 220 0 2 132 42 22
16882663 3 2 1 158 893 24 884 254 0 800 ... 0 376 3 2 88 2 4 13 8 25

5 rows × 23 columns

In [36]:
test_on_dataset(trainX, testX, trainY, testY, 
                task_name='avazu100k', classification=True, use_pylibfm=False)
Out[36]:
time ROC AUC
logistic 40.417285 0.717096
libFM 7778.470057 0.730173
fastFM 6601.202590 0.699962

Avazu 1m

In [35]:
trainX, testX, trainY, testY = load_problems.load_problem_ad(train_size=1000000)
# taking max hash of 1000 for each category
trainX = trainX % 1000
testX = testX % 1000
test_on_dataset(trainX, testX, trainY, testY, 
                task_name='avazu1m', classification=True, use_pylibfm=False)
Out[35]:
time ROC AUC
logistic 228.501853 0.740987
libFM 9109.968575 0.748516
fastFM 7962.800504 0.733875

Results

composing all results in one table. RMSE should be minimal, ROC AUC - maximal.

In [69]:
results_table = pandas.DataFrame()
tuples = []

for name in ['ml100k, ids', 'ml-1m,ids', 'ml100k', 'ml-1m', 'flight1m', 'flight1m, ext', 'flight10m', 'avazu100k', 'avazu1m']:
    df = all_results[name]
    results_table[name + ' (time)'] = df['time']
    metric_name = df.columns[-1]
    results_table[name + metric_name] = df[metric_name]
    tuples.append([name, 'time'])
    tuples.append([name, df.columns[-1]])
    
results_table = results_table.T
results_table.index = pandas.MultiIndex.from_tuples(tuples, names=['dataset', 'value'])
results_table.T
Out[69]:
dataset ml100k, ids ml-1m,ids ml100k ml-1m flight1m flight1m, ext flight10m avazu100k avazu1m
value time RMSE time RMSE time RMSE time RMSE time ROC AUC time ROC AUC time ROC AUC time ROC AUC time ROC AUC
logistic 0.059469 0.942771 1.111601 0.910718 1.869114 0.942377 23.983249 0.911024 26.408147 0.715099 130.553217 0.739113 307.085924 0.715754 40.417285 0.717096 228.501853 0.740987
libFM 8.970990 0.913520 275.672684 0.861539 49.632649 0.896349 779.900802 0.850382 441.534214 0.720484 1284.598730 0.760524 8500.038059 0.724258 7778.470057 0.730173 9109.968575 0.748516
fastFM 4.840041 0.915184 307.400295 0.858305 53.611804 0.896543 1170.468130 0.852738 667.197644 0.718840 2398.107206 0.748448 10718.261802 0.721615 6601.202590 0.699962 7962.800504 0.733875
pylibfm 13.157164 0.944870 132.618739 0.870263 55.756278 NaN 564.922632 NaN 316.149734 0.711200 514.768003 0.743312 NaN NaN NaN NaN NaN NaN

Conclusion

  • pylibfm is out of the game. It is slow, it crashes on large datasets, sometimes simply diverge and hardly can compete in quality.
    Nothing new, adaptive methods require babysitting
  • FastFM and LibFM are quite stable and fast
  • but LibFM, being a bit faster, on average provides much better results.

As a sidenote, we saw on example with flight dataset that some feature engineering with providing quadratic features gives very significant boost in quality - even logisitic regression can work much better and faster than FMs on original features.

This post was written in IPython. You can download the notebook from repository.