Full replication
INFO:tensorflow:Enabling eager execution
INFO:tensorflow:Enabling v2 tensorshape
INFO:tensorflow:Enabling resource variables
INFO:tensorflow:Enabling tensor equality
INFO:tensorflow:Enabling control flow v2

This notebook reproduces every table in LaBrie et al's 2008 paper on casino gambling behaviour. To get started, download the raw data from the link below from the transparency project's website. The data we need is Raw Dataset 2 (text version) under the title 'Virtual Casino Gambling: February 2005 through February 2007' towards the bottom of the page.

Once you've downloaded and extracted it, you should see a file called RawDataSet2_DailyAggregCasinoTXT.txt - copy this into the same directory as this notebook to begin.

The first step is to import the gamba framework, run the cell below to do so. If this cell throws an error, see the install documentation page to make sure you have gamba installed.

import gamba as gb

With gamba loaded, the next step is to get the data into a usable format. To do this, we call the prepare_labrie_data method from the data module. This does two things, first it renames the columns to values compatable with the gamba framework, then it saves this newly compatable dataframe as a new csv file (in case it's needed elsewhere).

all_player_bets = gb.data.prepare_labrie_data('RawDataSet2_DailyAggregCasinoTXT.txt')

In two lines of code we're ready to start the analysis, and have each player's transactions individually saved in-case anything goes wrong or we want to take a sample. The next step is to load in the data we just prepared, this uses some magic from the glob library to load every CSV file in the labrie_individuals/ folder into the variable all_player_bets.

If we want to do any other analysis on all of the players this is where we would add new methods, but let's crack on with calculating each of the measures described in the paper - which includes things like frequency, duration, total amount wagered, etc. Heads up: this calculation can take up to 10 minutes on a normal computer, so now is a great time to share this page with a colleague, or tweet us your feedback!

measures_table = gb.measures.calculate_labrie_measures(all_player_bets, loud=True)
100%|██████████| 4222/4222 [15:26<00:00,  4.56it/s]
LaBrie measures saved

The cell above took a while to finish, to make sure we don't have to do that computation again the output has been saved as gamba_labrie_measures.csv next to this notebook. We'll come back to this file later to make sure this recreation matches the original, but lets keep going! Time for the first meaningful output, the first table in the original paper - which describes the measures we just calculated using basic statistics;

measures_table = gb.read_csv('gamba_labrie_measures.csv')
labrie_table = gb.statistics.descriptive_table(measures_table)
mean std median
duration 299.206774 236.670214 261.000000
frequency 16.326440 21.017499 7.429744
num_bets 3515.047845 12210.792339 532.000000
mean_bets_per_day 115.950530 191.986075 48.911275
mean_bet_size 34.760626 184.090509 4.158098
total_wagered 27171.583452 109603.915777 2603.340000
net_loss 839.710129 3229.177371 117.250000
percent_loss 7.726259 11.579289 5.494480

Nice! Looks like the original! Next up is the Spearman's R coefficient matrix, which tells us how the measures relate to one-another. Run the next cell;

spearman_coefficient_table = gb.statistics.spearmans_r(measures_table)
duration frequency num_bets mean_bets_per_day mean_bet_size total_wagered net_loss percent_loss
duration -
frequency -0.63** -
num_bets 0.26** 0.22** -
mean_bets_per_day 0.01 0.13** 0.87** -
mean_bet_size 0.05** 0.09** -0.24** -0.41** -
total_wagered 0.27** 0.27** 0.66** 0.41** 0.52** -
net_loss 0.23** 0.16** 0.49** 0.33** 0.32** 0.7** -
percent_loss -0.07** -0.18** -0.26** -0.14** -0.27** -0.43** 0.2** -

Nice x2! Now that the first two tables from the paper have been reproduced, the measures need splitting into the top 5% and remaining 95% of players by their total amount wagered. The split_labrie_measures method from the gamba.studies module does this, returning the two splits as dataframes.

labelled_measures = gb.labelling.top_split(measures_table, 'total_wagered', loud=True)
top count: 212
other count: 4010

With the two cohorts seperated, the last part of the paper uses the same descriptive table to present their differences. To reproduce that using gamba, we simply call the same method as the first table on each of the cohorts;

labelled_groups = gb.labelling.get_labelled_groups(labelled_measures, 'top_total_wagered')
top5_table = gb.statistics.descriptive_table(labelled_groups[1])
other95_table = gb.statistics.descriptive_table(labelled_groups[0])
display(top5_table, other95_table)
mean std median
duration 476.023585 232.523748 529.000000
frequency 24.299398 16.768756 20.148853
num_bets 24558.452830 36778.787680 10464.500000
mean_bets_per_day 284.988926 343.720575 188.175380
mean_bet_size 213.216382 682.412984 24.965530
total_wagered 345579.044151 354890.079012 233915.905000
net_loss 8746.290143 11212.842660 6698.250000
percent_loss 2.574661 2.593843 2.521351
top_total_wagered 1.000000 0.000000 1.000000
mean std median
duration 289.858853 233.213355 246.000000
frequency 15.904928 21.136601 6.861726
num_bets 2402.528678 7819.175770 485.500000
mean_bets_per_day 107.013836 176.064687 45.897791
mean_bet_size 25.326057 96.945569 3.826164
total_wagered 10338.071814 19359.770871 2284.452150
net_loss 421.706398 938.710858 107.000000
percent_loss 7.998612 11.804173 5.849977
top_total_wagered 0.000000 0.000000 0.000000

That's it! In around 10 lines of code the gamba framework can fully replicate the findings of LaBrie et al's 2008 paper. The most interesting question now is how to expand this analysis to uncover more details from the data, or to calculate new behavioural measures and see if they are useful in any way.