Full replication
INFO:tensorflow:Enabling eager execution
INFO:tensorflow:Enabling v2 tensorshape
INFO:tensorflow:Enabling resource variables
INFO:tensorflow:Enabling tensor equality
INFO:tensorflow:Enabling control flow v2

This notebook attempts to reproduce the two tables found in Braverman and Shaffer's 2010 paper on behavioural markers for high-risk internet gambling. To get started, download the data titled 'How Do Gamblers Start Gambling: Identifying Behavioural Markers for High-risk Internet Gambling' through the link below - you'll need the text files under 'Raw Dataset 2' and 'Analytic Dataset';

File names: RawDataSet2_DailyAggregation.txt and AnalyticDataSet_HighRisk.txt

Flag Data description above implies RawDataSet2 contains actual betting data for players for the duration of the study, when it appears to only include a maximum of 31 days of betting data. This means the AnalyticDataSet cannot be faithfully reproduced using the raw data alone as the analytic data incudes full-duration behavioural measures (see final cell).

Flag The trajectory measure calculated here disagrees with the analytic data set, it specifically shows more extreme values for the gradient of the stakes. The reason for this issue is described below.

With the data downloaded, the first step is to import gamba, run the cell below to get started;

import gamba as gb

With gamba ready, we need to load in both the analytic and raw data sets from the link above - we need to recreate the analytical data set from the raw data;

raw_data = gb.data.read_csv('RawDataSet2_DailyAggregation.txt', delimiter='\t', parse_dates=['TimeDATE'])
analytic_data = gb.data.read_csv('AnalyticDataSet_HighRisk.txt', delimiter='\t')
print('raw data loaded:', len(raw_data))
print('analytic data loaded:', len(analytic_data))
raw data loaded: 5161
analytic data loaded: 530

At this point, the data can be prepared for use in the gamba library. This can be done with the purpose-built prepare_braverman_data method in the gamba.data module;

all_player_bets = gb.data.prepare_braverman_data('RawDataSet2_DailyAggregation.txt')

Now for the start of the study's replication - we begin by calculating the measures reported in the paper which include intensity, frequency, variability, trajectory, sum of stakes, total number of bets, average bet size, duration of account betting, and the net loss incurred for each player. These are all included in the calculate_braverman_measures method in the gamba.measures module;

measures = gb.measures.calculate_braverman_measures(all_player_bets) # this method saves them to a file called 'gamba_braverman_measures.csv'
measures.sort_values('player_id', inplace=True) # lets sort them by ID and display the first 3;
display(measures.head(3))
100%|██████████| 530/530 [00:07<00:00, 68.71it/s]
player_id intensity frequency variability trajectory sum_of_stakes total_num_bets mean_bet_size duration net_loss
364 1324368 6.160000 80.645161 104.616151 4.715685 2256.9700 154 14.655649 31 -354.4800
481 1324786 3.000000 43.333333 30.580785 3.617582 409.8700 39 10.509487 30 86.0700
484 1324808 4.285714 25.000000 5.516551 -0.917025 41.4455 30 1.381517 28 -4.9835

As a sanity check, we can display the original measures calculated for the three players above (after renaming the columns to more intuitive ones);

players = measures['player_id'].values[:3] # get only the first 3 values (those above)
display(analytic_data.head(3))
analytic_data['average_bet_size'] = analytic_data['p2sumstake'] / analytic_data['p2sumbet']
original_analysis = analytic_data[['UserID','p2bpd1m','p2totalactivedays1m','p2stakeSD1m','p2stakeSlope1m','p2sumstake','p2sumbet','average_bet_size','p2intvday','p2net']]
original_analysis.columns = ['player_id','intensity','frequency','variability','trajectory','sum_of_stakes','total_num_bets','average_bet_size','duration','net_loss']
original_analysis.sort_values('player_id', inplace=True) # after changing the column names, sort them by player ID (as above)
display(original_analysis.head(3))
UserID CountryID Gender ageRDATE Sereason p2sumstake p2sumbet p2sumday p2intvday p2bpd ... p2stakeSlope1m Zp2bpd1m Zp2stakeSD1m Zp2totalactivedays1m Zp2stakeslope1m random p2clusteringactivity p2clusterhalf1 p2clusterhalf2 average_bet_size
0 1324368 620 1 21 3 6202.5700 347 68 308 5.102941 ... 0.038441 0.367245 0.410294 3.456178 0.155683 0.512291 3 3 17.874841
1 1324786 40 1 19 1 665.8765 57 26 540 2.192308 ... 0.122799 -0.239092 -0.069475 1.344293 0.273148 0.410518 4 4 11.682044
2 1324808 616 1 20 2 843.2210 306 105 679 2.914286 ... -0.108439 0.007609 -0.231898 0.288350 -0.048842 0.384736 4 4 2.755624

3 rows × 24 columns

player_id intensity frequency variability trajectory sum_of_stakes total_num_bets average_bet_size duration net_loss
0 1324368 6.160000 25 104.616151 0.038441 6202.5700 347 17.874841 308 57.9500
1 1324786 3.000000 13 30.580785 0.122799 665.8765 57 11.682044 540 116.0765
2 1324808 4.285714 7 5.516551 -0.108439 843.2210 306 2.755624 679 -27.4005

This is a little puzzling as some of the measures align, yet others such as total_num_bets and duration appear to be underestimates compared to the original analysis, the trajectory measure also appears more extreme. To find out what's causing this difference, we can explore the duration of the data in the raw data set;

raw_data = gb.data.read_csv('RawDataSet2_DailyAggregation.txt', delimiter='\t', parse_dates=['TimeDATE'])

all_player_ids = set(list(raw_data['UserID']))
max_duration = 0
for player_id in all_player_ids:
    
    player_bets = raw_data[raw_data['UserID'] == player_id].copy()
    
    player_bets.rename(columns={'TimeDATE':'bet_time'}, inplace=True)
    duration = gb.measures.duration(player_bets)
    
    if duration > max_duration:
        max_duration = duration

print('unique players found:', len(all_player_ids))
print('maximum duration found:', max_duration)
unique players found: 530
maximum duration found: 31

The raw data contains a maximum of 31 days of betting data per player, therefore the analytic data set cannot be completely reproduced using the raw data alone, hence the original analytic data will be taken forward as opposed to an exactly replicated data set.

This means that as we cannot compute the measures exactly, the next best thing is to verify the accuracy of the clustering described in the paper, we can do this using the k_means functions from gamba's clustering module;

This next cell aims to recreate the k-means method described on page 3 of the paper, under the heading Statistical analysis;

standardised_measures_table = gb.measures.standardise_measures_table(original_analysis)

clustered_data = gb.machine_learning.k_means(standardised_measures_table, clusters=4, data_only=True)

gb.machine_learning.describe_clusters(clustered_data)
n=378 n=110 n=13 n=29
cluster_centroid
intensity -0.369099 0.993864 0.979338 0.602167
frequency -0.406867 0.978256 0.850219 1.211546
variability -0.255534 0.142271 4.409441 0.814460
trajectory -0.022931 0.133540 -0.076290 -0.173440
sum_of_stakes -0.305971 0.143319 0.914156 3.034755
total_num_bets -0.320316 0.487507 0.046568 2.305120
average_bet_size -0.159732 -0.068494 3.206291 0.904531
duration -0.219289 0.532655 -0.159865 0.909564
net_loss -0.275880 0.066042 0.406459 3.163248
cluster 0.000000 1.000000 2.000000 3.000000

The random initialisation of the k-means algorithm means that it is unlikely to exactly reproduce any previous k-means clusters on the data. This is a problem for exact replication, but, we can be sure that the algorithm is being applied and that clusters are being identified based on the descriptions above.

Note: this can be fixed by seeding probabilistic algorithms in the future!

What to do with this information is the next question, and as the raw data to analytic data attempt above showed some discrepancies, it's impossible to exactly replicate this particular study (as with other types of probabilistic analyses).