This notebook shows how gamba can be used to reproduce findings from Philander's 2014 study on data mining methods for detecting high-risk gamblers.
It uses data available through the transaparency project above, and performs eight distinct supervised machine learning techniques.
Note: given the high dimensionality of this data (17), the sample size (530) doesn't meet the rule of thumb that 10x17 (or 1700) observations are required for learning to be generalisable. This means that the ouputs of the methods below may change drastically upon repeated executions, and comparison to the original may not be meaningful. That considered, this notebook shows you how to do this kind of analysis using identical methods.
To begin, import gamba as usual;
import gamba as gb
measures_table = gb.data.prepare_philander_data('AnalyticDataSet_HighRisk.txt', loud=True)
log_r, log_rp = gb.machine_learning.logistic_regression(measures_table, 'self_exclude', train_test_split=0.696)
lasso_l, lasso_lp = gb.machine_learning.lasso_logistic_regression(measures_table, 'self_exclude', train_test_split=0.696)
Neural Networks
The following cell uses the Keras library to create and train some neural networks as described in the study. The original study uses the R nnet and caret packages, this stackoverflow post was helpful in understanding the original parameters.
Framing the self_exclude label (0 or 1) as a regression problem means creating a neural network which returns a continuous label. The classification version of the neural network used in the original analysis uses an identical network topology but passes two strings as values instead of a 1 or 0. This should in theory have no substantial difference on the performance of the network (given the sample size and identical architectures).
By contrast, the gamba library's neural network methods have subtly different topologies for classification and regression as described in Deep Learning with Python, which are used here.
nn_c, nn_cp = gb.machine_learning.neural_network_classification(measures_table, 'self_exclude', train_test_split=0.696)
nn_r, nn_rp = gb.machine_learning.neural_network_regression(measures_table, 'self_exclude', train_test_split=0.696)
Support Vector Machines (SVMs)
The following cell uses scikit-learn's SVM methods to create and trains some SVM's. The original paper uses Dimitriadou et al's implementations in R described here.
svm_e, svm_ep = gb.machine_learning.svm_eps_regression(measures_table, 'self_exclude', train_test_split=0.696)
svm_c, svm_cp = gb.machine_learning.svm_c_classification(measures_table, 'self_exclude', train_test_split=0.696)
svm_o, svm_op = gb.machine_learning.svm_one_classification(measures_table, 'self_exclude', train_test_split=0.696)
Random Forest
This section implements scikit-learn's ensemble methods to create random forests for classification and regression
rf_r, rf_rp = gb.machine_learning.rf_regression(measures_table, 'self_exclude', train_test_split=0.696)
rf_c, rf_cp = gb.machine_learning.rf_classification(measures_table, 'self_exclude', train_test_split=0.696)
Finally lets present the performance of each of the machine learning techniques using a number of metrics. Not all of the metrics apply to all of the methods, but it's a good way to see roughly how they compare.
all_results = [
gb.machine_learning.compute_performance('Logistic Regression', log_r, log_rp),
gb.machine_learning.compute_performance('Lasso Logistic Regression', lasso_l, lasso_lp),
gb.machine_learning.compute_performance('NN Regression', nn_r, nn_rp),
gb.machine_learning.compute_performance('NN Classification', nn_c, nn_cp),
gb.machine_learning.compute_performance('SVM eps-Regression', svm_e, svm_ep),
gb.machine_learning.compute_performance('SVM c-Classification', svm_c, svm_cp),
gb.machine_learning.compute_performance('SVM one-Classification', svm_o, svm_op),
gb.machine_learning.compute_performance('RF Regression', rf_r, rf_rp),
gb.machine_learning.compute_performance('RF Classification', rf_c, rf_cp)
]
all_results_df = gb.concat(all_results)
display(all_results_df)