Apply simple machine learning algorithms to a measures table
INFO:tensorflow:Enabling eager execution
INFO:tensorflow:Enabling v2 tensorshape
INFO:tensorflow:Enabling resource variables
INFO:tensorflow:Enabling tensor equality
INFO:tensorflow:Enabling control flow v2


Overview

The gamba library contains several wrapper methods for commonly applied machine learning methods. Whilst it would be impossible to cover every machine learning method used in existing literature, the gamba library aims to provide a number of examples so that if you have a particular method in mind, some variant or similar implementation is available for you to extend.

Methods

Clustering

Clustering techniques have been popular in early literature, but continue to be applied across a number of problems. The gamba library contains wrapper methods around the scikit-learn library, which has implementations of a number of clustering algorithms.

k_means[source]

k_means(measures_table, clusters=4, data_only=False, plot=False, loud=False)

Applies the k-means clustering algorithm to a measures table. The resulting clustering is then reported in terms of its inertia (sum of squared distances of samples to their closest cluster center) and its silhouette score (how distinct clusters are within the sample [see the skikit learn docs for details]). The measures passed as the first parameter can be returned with an added column reporting the cluster each player belongs to using the data_only parameter.

k_means_range[source]

k_means_range(measures_table, min_clusters=2, max_clusters=13)

Computes the k_means calculation above across a range of cluster counts, returning their goodness of fit measures (inertia and silhouette).

k_means_ensemble[source]

k_means_ensemble(measures_table, ensemble_size=100, min_clusters=2, max_clusters=13)

Computes the k_means clustering algorithm across a range of cluster counts, a number of times. This is useful for determining clusters in a more robust way but can be slow on large data sets.

agglomerative_cluster[source]

agglomerative_cluster(measures_table, distance_threshold=0, n_clusters=None)

Performs sklearn's agglomerative clustering algorithm on a dataframe of behavioural measures. See their documentation for details. Note: Either the distance threshold or the n_cluster parameter must be None.

describe_clusters[source]

describe_clusters(clustered_measures_table, cluster_col='cluster')

Describes cluster centroids (mean values of each measure) for each cluster in a clustered measures table.

Neural Networks

Neural networks are a powerful way to learn high dimensional patterns in transaction data sets. The gamba library contains limited neural network capabilities, but provide fully annotated methods so you can copy them into your own workflows and extend them as needed.

neural_network_classification[source]

neural_network_classification(measures_table, label, train_test_split=0.7)

A basic example of a classification network using a single sigmoid-activated output node. The output is the probability of the player belonging to the value of the label. As with the neural_network_regression this network closely follows examples in Deep Learning with Python which is a great resource for extending this example.

neural_network_regression[source]

neural_network_regression(measures_table, label, train_test_split=0.7)

A very simple neural network for regression using two hidden layers and one output node with a linear activation function. This method is very similar to the example in Deep Learning with Python, and can be used as a template for more complex network topologies.

Logistic Regression

Logistic regression techniques can map simple relationships between variables, but typically become untenable in more complex applications. As with the clustering techniques, the gamba library contains wrappers around the sklearn library's logistic regression functions, offering simple methods to run analyses without having to interact with the sklearn library directly (although you should for advanced techniques!).

logistic_regression[source]

logistic_regression(measures_table, label, train_test_split=0.7)

Performs a logistic regression using the statsmodels library, returning the predicted labels rounded to the nearest integer. Note: this method is currently hard-coded to only function on Philander 2014's data set. Abstracted logistic regression function coming soon.

lasso_logistic_regression[source]

lasso_logistic_regression(measures_table, label, train_test_split=0.7)

Performs a 'lasso' (optimizes a least-square problem with L1 penalty) logistic regression using sklearn's linear_model. This stackoverflow post contains a useful discussion on this type of function-estimation regression pair.

Support Vector Machines

SVM's are another type of machine learning algorithm but have seen limited use in existing work, but are included as templates to see how they could be used as part of an analysis using the gamba library. The three SVM methods implemented include those used in Philander's 2014 study.

svm_eps_regression[source]

svm_eps_regression(measures_table, label, train_test_split=0.7)

Creates and trains a support vector machine for epsilon-support vector regression using the sklearn library's implementation.

svm_c_classification[source]

svm_c_classification(measures_table, label, train_test_split=0.7)

Creates and trains a support vector machine for classification using the sklearn library's implementation.

svm_one_classification[source]

svm_one_classification(measures_table, label, train_test_split=0.7)

Creates and trains a support vector machine for one-class classification using the sklearn library's implementation.

Random Forests

Random forests are collections of decision trees which are applied en-mass to classification and regression problems. The gamba library implements both techniques in a very basic form, which follow a similar structure to other machine learning methods in the library. As with many other methods on this page, the gamba library simply contains wrappers for the scikit-learn library, which should be used directly for more advanced analyses!

rf_regression[source]

rf_regression(measures_table, label, train_test_split=0.7)

Creates and fits a random forest regressor using sklearn's ensemble module, returning the predicted labels rounded to the nearest integer.

rf_classification[source]

rf_classification(measures_table, label, train_test_split=0.7)

Creates a random forest for classification also using the sklearn library's implementation.

Performance

Performance measures of many machine learning methods can be derived from the values they predict and the actual values of a test dataset. The compute_performance function below is useful for quick estimates of performance, but you should refer to the measures appropriate to the specific method you're implementing for advanced analyses.

compute_performance[source]

compute_performance(method_name, actual, predicted)

Computes performance metrics including sensitivity, specificity, accuracy, confusion matrix values, odds ratio, and area under curve, for a given classification/regression using its actual and predicted values.

Plotting

Some machine learning outputs favour plotting over raw tabular presentation. An example of this is a hierarchical clustering dendrogram, which is much more intuitive than a cluster membership report. The gamba library contains several methods for plotting the outputs of machine learning models;

import scipy.cluster.hierarchy as sch
import scipy.ndimage.filters as snf
import matplotlib.pyplot as plt


plot_cluster_sizes[source]

plot_cluster_sizes(model)

Create a bar chart using a previously computed clustering model. Each bar represents a single cluster, with the height of the bars representing the number of members (players) in each cluster.

plot_agglomeration_dendrogram[source]

plot_agglomeration_dendrogram(model, dt_cutoff=None, **kwargs)

Create a dendrogram visualising a heirarchical clustering method (agglomerative clustering). A horisontal line can be added using the dt_cutoff parameter to visualise the number of clusters at a given distance threshold.