Full replication
INFO:tensorflow:Enabling eager execution
INFO:tensorflow:Enabling v2 tensorshape
INFO:tensorflow:Enabling resource variables
INFO:tensorflow:Enabling tensor equality
INFO:tensorflow:Enabling control flow v2

This notebook shows how gamba can be used to reproduce findings from Philander's 2014 study on data mining methods for detecting high-risk gamblers.

It uses data available through the transaparency project above, and performs eight distinct supervised machine learning techniques.

Note: given the high dimensionality of this data (17), the sample size (530) doesn't meet the rule of thumb that 10x17 (or 1700) observations are required for learning to be generalisable. This means that the ouputs of the methods below may change drastically upon repeated executions, and comparison to the original may not be meaningful. That considered, this notebook shows you how to do this kind of analysis using identical methods.

To begin, import gamba as usual;

import gamba as gb
measures_table = gb.data.prepare_philander_data('AnalyticDataSet_HighRisk.txt', loud=True)
530 players loaded

Logistic Regressions

The machine learning module has wrappers for two logistic regression functions which can be used here. As with other machine learning methods in the gamba library, they return both the actual test labels and the predicted labels so that performance metrics can be computed.

log_r, log_rp = gb.machine_learning.logistic_regression(measures_table, 'self_exclude', train_test_split=0.696)
lasso_l, lasso_lp = gb.machine_learning.lasso_logistic_regression(measures_table, 'self_exclude', train_test_split=0.696)

Neural Networks

The following cell uses the Keras library to create and train some neural networks as described in the study. The original study uses the R nnet and caret packages, this stackoverflow post was helpful in understanding the original parameters.

Framing the self_exclude label (0 or 1) as a regression problem means creating a neural network which returns a continuous label. The classification version of the neural network used in the original analysis uses an identical network topology but passes two strings as values instead of a 1 or 0. This should in theory have no substantial difference on the performance of the network (given the sample size and identical architectures).

By contrast, the gamba library's neural network methods have subtly different topologies for classification and regression as described in Deep Learning with Python, which are used here.

nn_c, nn_cp = gb.machine_learning.neural_network_classification(measures_table, 'self_exclude', train_test_split=0.696)
nn_r, nn_rp = gb.machine_learning.neural_network_regression(measures_table, 'self_exclude', train_test_split=0.696)

Support Vector Machines (SVMs)

The following cell uses scikit-learn's SVM methods to create and trains some SVM's. The original paper uses Dimitriadou et al's implementations in R described here.

svm_e, svm_ep = gb.machine_learning.svm_eps_regression(measures_table, 'self_exclude', train_test_split=0.696)
svm_c, svm_cp = gb.machine_learning.svm_c_classification(measures_table, 'self_exclude', train_test_split=0.696)
svm_o, svm_op = gb.machine_learning.svm_one_classification(measures_table, 'self_exclude', train_test_split=0.696)

Random Forest

This section implements scikit-learn's ensemble methods to create random forests for classification and regression

rf_r, rf_rp = gb.machine_learning.rf_regression(measures_table, 'self_exclude', train_test_split=0.696)
rf_c, rf_cp = gb.machine_learning.rf_classification(measures_table, 'self_exclude', train_test_split=0.696)

All Methods Together

Finally lets present the performance of each of the machine learning techniques using a number of metrics. Not all of the metrics apply to all of the methods, but it's a good way to see roughly how they compare.

all_results = [
    gb.machine_learning.compute_performance('Logistic Regression', log_r, log_rp),
    gb.machine_learning.compute_performance('Lasso Logistic Regression', lasso_l, lasso_lp),
    gb.machine_learning.compute_performance('NN Regression', nn_r, nn_rp),
    gb.machine_learning.compute_performance('NN Classification', nn_c, nn_cp),
    gb.machine_learning.compute_performance('SVM eps-Regression', svm_e, svm_ep),
    gb.machine_learning.compute_performance('SVM c-Classification', svm_c, svm_cp),
    gb.machine_learning.compute_performance('SVM one-Classification', svm_o, svm_op),
    gb.machine_learning.compute_performance('RF Regression', rf_r, rf_rp),
    gb.machine_learning.compute_performance('RF Classification', rf_c, rf_cp)
]

all_results_df = gb.concat(all_results)
display(all_results_df)
sensitivity specificity accuracy precision auc odds_ratio
Logistic Regression 0.080 0.901 0.646 0.267 0.490 0.791
Lasso Logistic Regression 0.094 0.972 0.683 0.625 0.533 3.646
NN Regression 0.431 0.545 0.509 0.306 0.488 0.910
NN Classification 0.000 1.000 0.640 NaN 0.500 0.000
SVM eps-Regression 0.020 0.991 0.683 0.500 0.505 2.180
SVM c-Classification 0.000 1.000 0.646 NaN 0.500 0.000
SVM one-Classification 0.462 0.505 0.491 0.308 0.483 0.873
RF Regression 0.373 0.755 0.634 0.413 0.564 1.825
RF Classification 0.241 0.869 0.658 0.481 0.555 2.106