INFO:tensorflow:Enabling eager execution INFO:tensorflow:Enabling v2 tensorshape INFO:tensorflow:Enabling resource variables INFO:tensorflow:Enabling tensor equality INFO:tensorflow:Enabling control flow v2
The gamba library contains several wrapper methods for commonly applied machine learning methods. Whilst it would be impossible to cover every machine learning method used in existing literature, the gamba library aims to provide a number of examples so that if you have a particular method in mind, some variant or similar implementation is available for you to extend.
Applies the k-means clustering algorithm to a measures table. The resulting clustering is then reported in terms of its inertia (sum of squared distances of samples to their closest cluster center) and its silhouette score (how distinct clusters are within the sample [see the skikit learn docs for details]). The measures passed as the first parameter can be returned with an added column reporting the cluster each player belongs to using the data_only parameter.
Computes the k_means calculation above across a range of cluster counts, returning their goodness of fit measures (inertia and silhouette).
Computes the k_means clustering algorithm across a range of cluster counts, a number of times. This is useful for determining clusters in a more robust way but can be slow on large data sets.
Performs sklearn's agglomerative clustering algorithm on a dataframe of behavioural measures. See their documentation for details. Note: Either the distance threshold or the n_cluster parameter must be None.
A basic example of a classification network using a single sigmoid-activated output node. The output is the probability of the player belonging to the value of the label. As with the
neural_network_regression this network closely follows examples in Deep Learning with Python which is a great resource for extending this example.
A very simple neural network for regression using two hidden layers and one output node with a linear activation function. This method is very similar to the example in Deep Learning with Python, and can be used as a template for more complex network topologies.
Logistic regression techniques can map simple relationships between variables, but typically become untenable in more complex applications. As with the clustering techniques, the gamba library contains wrappers around the sklearn library's logistic regression functions, offering simple methods to run analyses without having to interact with the sklearn library directly (although you should for advanced techniques!).
Performs a logistic regression using the statsmodels library, returning the predicted labels rounded to the nearest integer. Note: this method is currently hard-coded to only function on Philander 2014's data set. Abstracted logistic regression function coming soon.
Performs a 'lasso' (optimizes a least-square problem with L1 penalty) logistic regression using sklearn's linear_model. This stackoverflow post contains a useful discussion on this type of function-estimation regression pair.
SVM's are another type of machine learning algorithm but have seen limited use in existing work, but are included as templates to see how they could be used as part of an analysis using the gamba library. The three SVM methods implemented include those used in Philander's 2014 study.
Creates and trains a support vector machine for epsilon-support vector regression using the sklearn library's implementation.
Creates and trains a support vector machine for classification using the sklearn library's implementation.
Creates and trains a support vector machine for one-class classification using the sklearn library's implementation.
Random forests are collections of decision trees which are applied en-mass to classification and regression problems. The gamba library implements both techniques in a very basic form, which follow a similar structure to other machine learning methods in the library. As with many other methods on this page, the gamba library simply contains wrappers for the scikit-learn library, which should be used directly for more advanced analyses!
Performance measures of many machine learning methods can be derived from the values they predict and the actual values of a test dataset. The
compute_performance function below is useful for quick estimates of performance, but you should refer to the measures appropriate to the specific method you're implementing for advanced analyses.
Computes performance metrics including sensitivity, specificity, accuracy, confusion matrix values, odds ratio, and area under curve, for a given classification/regression using its actual and predicted values.
Some machine learning outputs favour plotting over raw tabular presentation. An example of this is a hierarchical clustering dendrogram, which is much more intuitive than a cluster membership report. The gamba library contains several methods for plotting the outputs of machine learning models;
import scipy.cluster.hierarchy as sch import scipy.ndimage.filters as snf import matplotlib.pyplot as plt
Create a bar chart using a previously computed clustering model. Each bar represents a single cluster, with the height of the bars representing the number of members (players) in each cluster.
Create a dendrogram visualising a heirarchical clustering method (agglomerative clustering). A horisontal line can be added using the dt_cutoff parameter to visualise the number of clusters at a given distance threshold.