Full study replication
INFO:tensorflow:Enabling eager execution
INFO:tensorflow:Enabling v2 tensorshape
INFO:tensorflow:Enabling resource variables
INFO:tensorflow:Enabling tensor equality
INFO:tensorflow:Enabling control flow v2

This notebook reproduces every table in LaBrie et al's 2007 paper on internet sports gambling activity. To get started, dowload the raw data from the transparency project's website. The data we need is Raw Dataset 2 (text version) under the title 'Actual Internet Sports Gambling Activity: February 2005 through September 2005' towards the bottom of the page.

Once you've downloaded and extracted it, you should see a file called RawDataIIUserDailyAggregation.txt - copy this into the same directory as this notebook to begin.

Flag Minor discrepency between number of players taken forward after data cleaning - this has minor effects on fixed-odds figures below.

import gamba as gb

The first step is to split the raw data into CSV files for each player, this lets us calculate measures later on, on a per-player basis. For this particular study, the data must first be split by product ID (live action vs fixed odds betting)...

daily_data = gb.data.read_csv('RawDataIIUserDailyAggregation.txt')
daily_data[daily_data['ProductID'] == 1].to_csv('fixed_odds_daily.csv', index=False)
daily_data[daily_data['ProductID'] == 2].to_csv('live_action_daily.csv', index=False)
fo_data = gb.data.prepare_labrie_data('fixed_odds_daily.csv', year=2007)
la_data = gb.data.prepare_labrie_data('live_action_daily.csv', year=2007)

Now to calculate the behavioural measures used in the paper for each of the players, this includes the duration, frequency, number of bets, average bets per day, average bet size, total amount wagered, net loss, and percentage loss;

gb.measures.calculate_labrie_measures(fo_data, filename='fo_labrie_measures.csv', loud=True)
100%|██████████| 42157/42157 [1:04:04<00:00, 10.96it/s]
LaBrie measures saved
player_id duration frequency num_bets mean_bets_per_day mean_bet_size total_wagered net_loss percent_loss
0 1324354 219 53.424658 236 2.017094 42.954788 10137.3300 -86.7900 -0.856143
1 1324355 241 41.078838 231 2.333333 1.735325 400.8600 -52.4400 -13.081874
2 1324356 223 22.869955 98 1.921569 7.001939 686.1900 400.6800 58.391991
3 1324358 95 8.421053 7 0.875000 35.385300 247.6971 93.8215 37.877512
4 1324360 236 12.288136 40 1.379310 1.499982 59.9993 20.0429 33.405223
... ... ... ... ... ... ... ... ... ...
42152 1405181 206 16.019417 186 5.636364 2.404677 447.2700 313.2200 70.029289
42153 1405183 65 56.923077 45 1.216216 1.972444 88.7600 67.3000 75.822443
42154 1405184 34 94.117647 110 3.437500 2.431636 267.4800 58.1000 21.721250
42155 1405189 87 43.678161 115 3.026316 1.956870 225.0400 -11.8100 -5.247956
42156 1405190 6 50.000000 10 3.333333 5.000000 50.0000 50.0000 100.000000

42157 rows × 9 columns

gb.measures.calculate_labrie_measures(la_data, filename='la_labrie_measures.csv', loud=True)
100%|██████████| 26198/26198 [25:18<00:00, 17.25it/s]
LaBrie measures saved
player_id duration frequency num_bets mean_bets_per_day mean_bet_size total_wagered net_loss percent_loss
0 1324354 146 13.013699 43 2.263158 42.773953 1839.2800 326.7900 17.767279
1 1324355 7 100.000000 21 3.000000 1.176190 24.7000 13.5000 54.655870
2 1324356 222 10.810811 116 4.833333 5.854052 679.0700 53.9200 7.940271
3 1324358 1 100.000000 4 4.000000 22.148175 88.5927 32.6108 36.809805
4 1324360 222 0.900901 3 1.500000 0.581133 1.7434 0.5425 31.117357
... ... ... ... ... ... ... ... ... ...
26193 1405179 4 100.000000 13 3.250000 1.429231 18.5800 -20.9900 -112.970936
26194 1405181 155 1.935484 7 2.333333 5.500000 38.5000 21.8700 56.805195
26195 1405183 61 22.950820 39 2.785714 2.618718 102.1300 17.7000 17.330853
26196 1405184 33 63.636364 98 4.666667 2.796224 274.0300 -23.1000 -8.429734
26197 1405189 51 3.921569 3 1.500000 3.000000 9.0000 -8.8000 -97.777778

26198 rows × 9 columns

Next, for this replication we can take the user id's from the original analytic data set and take those from the measures dataset calculated above;

fo_gamba_measures = gb.data.read_csv('fo_labrie_measures.csv')
la_gamba_measures = gb.data.read_csv('la_labrie_measures.csv')
original = gb.data.read_csv('AnalyticDataInternetGambling.txt')

fo_bettors = original[original['FOTotalBets'] > 0]
la_bettors = original[original['LATotalBets'] > 0]

gamba_fo = fo_gamba_measures[fo_gamba_measures['player_id'].isin(fo_bettors['USERID'].values)]
gamba_la = la_gamba_measures[la_gamba_measures['player_id'].isin(la_bettors['USERID'].values)]
t1a = gb.statistics.descriptive_table(gamba_fo)
t1b = gb.statistics.descriptive_table(gamba_la)
display(t1a.round())
display(t1b.round())
    
fo_spearmans = gb.statistics.spearmans_r(gamba_fo)
la_spearmans = gb.statistics.spearmans_r(gamba_la)
display(fo_spearmans)
display(la_spearmans)
mean std median
duration 118.0 89.0 117.0
frequency 40.0 30.0 32.0
num_bets 135.0 496.0 36.0
mean_bets_per_day 3.0 6.0 2.0
mean_bet_size 12.0 32.0 4.0
total_wagered 730.0 3439.0 148.0
net_loss 97.0 579.0 33.0
percent_loss 32.0 62.0 29.0
mean std median
duration 79.0 83.0 40.0
frequency 43.0 37.0 27.0
num_bets 99.0 407.0 15.0
mean_bets_per_day 4.0 5.0 3.0
mean_bet_size 11.0 25.0 4.0
total_wagered 1319.0 8593.0 61.0
net_loss 85.0 571.0 9.0
percent_loss 23.0 61.0 18.0
duration frequency num_bets mean_bets_per_day mean_bet_size total_wagered net_loss percent_loss
duration -
frequency -0.57** -
num_bets 0.64** 0.01* -
mean_bets_per_day 0.2** 0.14** 0.71** -
mean_bet_size -0.16** -0 -0.37** -0.38** -
total_wagered 0.54** -0.01 0.75** 0.46** 0.27** -
net_loss 0.2** 0.01 0.33** 0.29** 0.16** 0.46** -
percent_loss -0.35** 0.06** -0.43** -0.19** -0 -0.46** 0.42** -
duration frequency num_bets mean_bets_per_day mean_bet_size total_wagered net_loss percent_loss
duration -
frequency -0.78** -
num_bets 0.7** -0.29** -
mean_bets_per_day 0.33** -0.05** 0.8** -
mean_bet_size 0.03** 0.04** 0.03** -0.02* -
total_wagered 0.58** -0.21** 0.83** 0.65** 0.54** -
net_loss 0.27** -0.07** 0.41** 0.37** 0.27** 0.5** -
percent_loss -0.25** 0.11** -0.32** -0.21** -0.1** -0.31** 0.47** -

With both the descriptive and inter-measure correlation tables complete, the sample of measures can be labelled according to the presence of a player in the top 1% of their cohort by a given measure. In this case the measures include the net loss, total amount wagered, and number of bets. This is done for both the fixed odds (fo) and live action (la) data...

fo_labelled = gb.labelling.top_split(gamba_fo, 'net_loss', percentile=99)
fo_labelled = gb.labelling.top_split(fo_labelled, 'total_wagered', percentile=99)
fo_labelled = gb.labelling.top_split(fo_labelled, 'num_bets', percentile=99)

t3a = gb.statistics.descriptive_table(fo_labelled[fo_labelled['top_net_loss'] == 1])
t3b = gb.statistics.descriptive_table(fo_labelled[fo_labelled['top_total_wagered'] == 1])
t3c = gb.statistics.descriptive_table(fo_labelled[fo_labelled['top_num_bets'] == 1])

la_labelled = gb.labelling.top_split(gamba_la, 'net_loss', percentile=99)
la_labelled = gb.labelling.top_split(la_labelled, 'total_wagered', percentile=99)
la_labelled = gb.labelling.top_split(la_labelled, 'num_bets', percentile=99)

t3d = gb.statistics.descriptive_table(la_labelled[la_labelled['top_net_loss'] == 1])
t3e = gb.statistics.descriptive_table(la_labelled[la_labelled['top_total_wagered'] == 1])
t3f = gb.statistics.descriptive_table(la_labelled[la_labelled['top_num_bets'] == 1])

t3_top = gb.data.concat([t3a, t3b, t3c], axis=1).reindex(t3a.index)
t3_bottom = gb.data.concat([t3d, t3e, t3f], axis=1).reindex(t3d.index)
t3_top.drop(t3_top.tail(3).index,inplace=True)
t3_bottom.drop(t3_bottom.tail(3).index,inplace=True)

display(t3_top)
display(t3_bottom)
mean std median mean std median mean std median
duration 189.653266 56.839253 215.000000 194.108040 52.510965 217.000000 204.321608 43.109924 220.000000
frequency 51.980067 23.270583 50.721289 57.717627 21.051259 55.939801 65.552105 20.646955 66.860598
num_bets 1541.447236 3237.726930 420.500000 1434.698492 3147.428568 421.500000 3493.422111 3149.957179 2369.500000
mean_bets_per_day 15.208532 43.117456 4.674917 11.165698 21.832324 4.093319 31.309630 43.178783 21.691200
mean_bet_size 55.451052 94.212106 22.879698 76.909173 95.948150 44.709555 2.876123 5.009361 1.355332
total_wagered 15003.748844 15703.525546 10195.054200 22866.646728 23854.204742 16735.253500 8421.379570 12884.908372 4153.159900
net_loss 3486.437569 2615.810563 2642.910000 1833.757579 4542.107773 1533.167300 1260.820220 2228.848492 742.825000
percent_loss 35.255165 22.175326 29.275363 9.610159 15.765185 8.818498 19.356744 17.400062 17.577639
mean std median mean std median mean std median
duration 188.875000 52.833599 213.000000 187.963710 50.231153 209.000000 206.362903 33.896087 217.000000
frequency 50.189225 23.585788 50.110132 57.298534 21.238467 55.690835 65.129066 17.533174 65.594503
num_bets 1760.600806 2675.437088 971.500000 1700.479839 2311.852698 1034.500000 2932.298387 2449.241309 2147.500000
mean_bets_per_day 15.710240 15.841488 11.201104 14.371118 13.344901 10.637707 22.559753 15.086767 17.986928
mean_bet_size 58.915359 63.382781 34.233412 80.316555 78.755138 52.998731 15.138555 25.973484 6.244990
total_wagered 47792.303745 56630.099339 29102.187250 64588.914254 52992.361609 44042.135000 35987.606031 54142.890433 15707.764100
net_loss 4179.576040 3059.423521 3051.990000 2637.989352 4262.404915 1971.470000 2151.269950 3110.824224 1110.375000
percent_loss 14.954674 11.762545 11.778838 4.350920 6.797097 4.312687 8.832085 7.074857 7.471395

Finally, explore the overlap between players labelled with different schemes as in the original paper;

fo_table = gb.label_overlap_table(fo_labelled, ['top_net_loss','top_total_wagered','top_num_bets'])
la_table = gb.label_overlap_table(la_labelled, ['top_net_loss','top_total_wagered','top_num_bets'])
display(fo_table)
display(la_table)
top_net_loss_only top_total_wagered_only top_num_bets_only top_net_loss and top_total_wagered only top_net_loss and top_num_bets only top_total_wagered and top_num_bets only all labels
top_net_loss 171 (43) - - 143 (36) 32 (8) - 52 (13)
top_total_wagered - 178 (45) - - - 25 (6) 52 (13)
top_num_bets - - 289 (73) - - - 52 (13)
top_net_loss_only top_total_wagered_only top_num_bets_only top_net_loss and top_total_wagered only top_net_loss and top_num_bets only top_total_wagered and top_num_bets only all labels
top_net_loss 92 (37) - - 65 (26) 24 (10) - 67 (27)
top_total_wagered - 89 (36) - - - 27 (11) 67 (27)
top_num_bets - - 130 (52) - - - 67 (27)

...