Load transaction data into a usable format
INFO:tensorflow:Enabling eager execution
INFO:tensorflow:Enabling v2 tensorshape
INFO:tensorflow:Enabling resource variables
INFO:tensorflow:Enabling tensor equality
INFO:tensorflow:Enabling control flow v2


## Overview

The data module can be used to load in transaction data sets from multiple sources into a format that the gamba library can use. This format is needed to create a measures table, and is simply a dataframe with fixed column names. Read through this page to find out how it works, plus some simple plotting at the end. This module also contains a number of data loading functions for existing public data repositories so you can replicate or extend work right from the start.

This can be done using the read_csv method, which is a wrapper around the pandas library's method of the same name.

#### read_csv[source]

read_csv(file, parse_dates=[], index_col=None, delimiter=',', dummy_data=False)

Read csv files into a pandas dataframe.

#### concat[source]

concat(dfs, axis=0)

This method is a wrapper of pandas' concat function. See here for details.

#### add_tables[source]

add_tables(t1, t2, same_columns=False)

Joins two tables (the second to the right hand side of the first), adding '_2' to column names if same_columns parameter is True.

This can be used to read a regular CSV file as you'd expect, but can also be used for tab-separated files - as some of the transparency project's data sets are.

## Setting Column Names

The gamba library's methods expect a dataframe with certain column names. The most important task after loading data as a pandas dataframe is to set these column names according to the type of data the column contains. These names should match the following table for basic data;

Column Name Description
player_id a unique identifier for each player
bet_size the size of the bet (in raw currency form, e.g. USD)
bet_time the datetime the bet was placed
payout_size the size of the payout (also in raw currency)

Advanced data sets may contain more information about each bet. These additional columns can be included using names from the table below. Note that methods in other parts of the library will reject dataframes which contain column names not in one these two tables.

Column Name Description
payout_time the timestamp that the payout was paid
decimal_odds the decimal odds for the given bet
house_edge the percentage taken by the house (value of 3 for 3% house edge)
game_type the game being played as a string e.g. 'coinflip', 'roulette' - doesn't have to be one of a fixed set but should be unique per game type
provider the operator's name - this is useful for mixed operator datasets

Several public repositories provide transaction data that can be loaded by the gamba library (see Public Repositories in the menu). The data module contains methods for loading some of these sets into the correct format, which are used in the respective replications. If you're loading in a similar data set, feel free to explore the source code of these methods to see how it's done, and modify them for your own needs!

#### prepare_labrie_data[source]

prepare_labrie_data(filename, savedir='labrie_individuals/', loud=False, year=2008)

Splits the original labrie data into CSV files for each individual's transactions and renames the columns to be compatable with the rest of the gamba library.

#### prepare_braverman_data[source]

prepare_braverman_data(filename, loud=False)

Splits the original Braverman and Shaffer data into CSV files for each indivdiual's transactions, and renames the columns to be compatable with the rest of the gamba library.

#### prepare_philander_data[source]

prepare_philander_data(filename, loud=False)

Loads in the analytic data set of high-risk internet gamblers and removes the UserID, Sereason, random, and clustering columns as described in Philander's 2014 study.

There's also an generate_bets method which returns a set of example transactions (entirely synthetic) to use as an example throughout the docs or to compare against your own data.

#### generate_bets[source]

generate_bets(n=200)

## Final Checks

It's good practice to check that your column names match those used by the gamba library, and make sure that no extra columns exist. The check_data method below can be given the dataframe, and it will raise an error if anything isn't as it should be;

#### check_data[source]

check_data(dataframe)

Make sure that your data is in the gamba standard format (has the right column names). This method will raise an exception if the format is incorrect, and will do nothing if it is correct.

#### summarise_app[source]

summarise_app(player_bets, dapp=False)

Prints out some basic information about a gambling or gambling-like application given a collection of bets made through that application. Data displayed includes the number of users, the number of game types provided, the number of bets placed, the total value of the bets and the payouts, the time of the first bet, and the time of the last.

#### summarise_providers[source]

summarise_providers(player_bets, providers, game_types=['coinflip', 'onedice', 'twodice', 'roll'])

Create a table containing summary data for providers using a collection of player bets. Summary includes the number of unique users and games, the total value of bets and payouts, the starting and ending block numbers, and the time the starting and ending blocks occurred.

## Plotting

The data module contains some basic visualisation methods which can be applied before any behavioural measures are calculated. This is useful for showing the distributions of player bet sizes, times, payouts, and so on.

#### plot_player_career[source]

plot_player_career(player_df, savename=None)

Creates a candlestick-style plot of a players betting activity over the course of their career. This works best on regularly-spaced sequential data but can also provide insight into intra-session win/loss patterns.

#### plot_player_career_split[source]

plot_player_career_split(player_df)

Plot a player's betting and payout trajectory on a single plot, with green indicating payouts (top) and red indicating bets (bottom). A cumulative value line is also plotted between the two. Note that the player_df must include both bet_size and payout_size columns.

#### plot_provider_dates[source]

plot_provider_dates(player_bets, providers, provider_labels=None)

Visualises the start and end dates of bets from one or more providers on a stacked gantt style plot.

#### plot_provider_bets[source]

plot_provider_bets(player_bets, providers, provider_labels=None)

Visualises the cumulative number of bets from one or more providers.