Run statistical tests on measures tables
INFO:tensorflow:Enabling eager execution
INFO:tensorflow:Enabling v2 tensorshape
INFO:tensorflow:Enabling resource variables
INFO:tensorflow:Enabling tensor equality
INFO:tensorflow:Enabling control flow v2


## Overview

With a measures table created, the first obvious step is to compute summary statistics across it, which can be compared to the measures tables from other providers or games, etc. and inform further analyses. The gamba library's statistics module contains methods for running these tests, plus some general descriptive methods.

## Methods

The most basic statistics function is the descriptive_table function, which returns a dataframe holding mean, median, standard deviation, and inter-quartile ranges of each of the measures in the table.

#### descriptive_table[source]

descriptive_table(measures_table, loud=False, extended=False)

Creates the first table found in LaBrie et al's 2008 paper, which presents descriptive statistics for each of the behavioural measures they calculated.

With basic descriptive statistics computed, the next logical step is to compute a number of deeper statistical tests regarding the normality of each of the measure's distributions (ks_test), and the correlations between the (potentially non-normally distributed) measures (spearmans_r).

#### ks_test[source]

ks_test(measures_table)

Performs a one sample Kolmogorov-Smirnov test. This approximately indicates whether or not a collection of calculated behavioural measures are normally distributed.

#### spearmans_r[source]

spearmans_r(measures_table, loud=False)

Calculates the coefficients (nonparametric Spearman's r) between a collection of behavioural measures. The upper-right diagonal of the resulting matrix is discarded (symmetric).

## Labeled Measures Table Statistics

The statistics module also contains methods for computing statistics between groups as indicated by a label column in the measures table.

#### cohens_d[source]

cohens_d(measures_table, label)

Calculates Cohen's d value between the behavioural measures of two groups of players. Groups are distinguished using a label column which is either 1 (in group) or 0 (not in group). For example, the column 'in_top5' may represent whether or not a player is in the top 5 % of players by total amount wagered, and would be 1 or 0 for the top 5 and remaining 95 percent respectively.

#### calculate_walker_matrix[source]

calculate_walker_matrix(measures_tables, labels, measure='frequency', loud=False)

Performs a two sample Kolmogorov-Smirnov test between collections of measure from different games.

#### label_overlap_table[source]

label_overlap_table(measures_table, labels)

Calculates the number of players under a collection of labels (exclusively), and on each pair of labels (again exclusively) in the list provided. This method can be used to reproduce the final table in LaBrie et al's 2007 paper {% cite labrie2007assessing %}.

## Utility Methods

The statistics module has some utility methods which may not be directly useful for an analysis but can be used to do simple tasks like join measures tables.

#### add_tables[source]

add_tables(t1, t2, same_columns=False)

Joins two tables (the second to the right hand side of the first), adding '_2' to column names if same_columns parameter is True.

#### color_matrix[source]

color_matrix(matrix, cmap)

Creates a shaded matrix based on a color map.

import scipy.stats
def standardise_measures_table(measures_table):
"Standardises all measures columns in a measures table by applying the scipy.stats.zscore function to each column. This is useful for column-wise comparisons and some clustering methods."
colnames = list(measures_table.columns)[1:]

standardised_table = pd.DataFrame()
standardised_table["player_id"] = measures_table["player_id"].values
for col in colnames:
standardised_table[col] = scipy.stats.zscore(measures_table[col].values)

return standardised_table


## Plotting

#### plot_measure_hist[source]

plot_measure_hist(measures, name)

Plots a histogram for a named measure from a dataframe of measures. Args: measures (Dataframe): Collection of behavioural measures for a cohort of players. name (String): The name of the measure to plot, e.g. 'duration'. Returns: Matplotlib.pyplot plot object.

#### plot_measure_centile[source]

plot_measure_centile(measures, name, top_heavy=False)

Plots centiles of a single named measure from a dataframe of measures. Args: measures (Dataframe): Collection of behavioural measures for a cohort of players. name (String): The name of the measure to plot, e.g. 'duration'. top_heavy (Boolean): Whether to plot each centile (100), or to plot every 5 up to 95 followed by 96-100 as individual percentiles (discontinuous x axis). Default is False (plot 100 bars). Returns: Matplotlib.pyplot plot object.

#### plot_measure_pair_plot[source]

plot_measure_pair_plot(measures, label_override=None, thermal=False, figsize=(14, 14))

Plots centiles of a single named measure from a dataframe of measures. Args: measures (Dataframe): Collection of behavioural measures for a cohort of players. label_override (List of Strings): List of axis labels, if None, the column names directly from the measures dataframe will be used. thermal (Boolean): Show 2D histograms instead of scatter plots (better for perceiving density). figsize (Tuple of Integers (2)): Size of the resulting plot, (14,14) is good for large numbers of measures (5+). Returns: Matplotlib.pyplot plot object.