Relational data structure

The primary data structure of the aindo.rdml package is the RelationalData class, which encapsulate a relational dataset consisting of one or more tables. A RelationalData object is a dictionary of pandas.DataFrame objects that contain the data tables, along with a Schema object that describes:

The types of the table columns.
The relational structure of the dataset, including primary and foreign keys.

The RelationalData class, together with the Schema class and other necessary classes for building one, can be found in the aindo.rdml.relational module.

The `Schema` class

A Schema object is a collection of named Table objects, also containing the information about relations among tables. Each Table object contains the columns of interest of that table. There are two primary types of columns:

PrimaryKey's and ForeignKey's define the relational structure of the data.
Feature Column's, namely columns that are not keys.

Feature columns

When building a Schema, to each feature column is associated a Column type. The associated type will instruct the various routines of the library on how to treat the data in the column. For example, a Column.CATEGORICAL will be preprocessed differently than a Column.INTEGER before being fed to the generative model during training (more info in the ColumnPreproc section). It will also appear differently in the evaluation report (more info in the Synthetic data report section).

The available Column types are: BOOLEAN, CATEGORICAL, NUMERIC, INTEGER, DATE, TIME, DATETIME, COORDINATES, ITAFISCALCODE, and TEXT.

Building a `Schema`

To illustrate how to build a Schema from scratch, let us work with (a subset of) the BasketballMen dataset, which consists of the following tables:

players: The root table with the primary key playerID.
season: A child table of players linked via the foreign key playerID.
all_star: Another child table of players connected by the foreign key playerID.

Let us load the tables with pandas and let us gather the pandas.DataFrame's into a dictionary:

import pandas as pd

df_players = pd.read_csv("path/to/basket/dir/players.csv")
df_season = pd.read_csv("path/to/basket/dir/season.csv")
df_all_star = pd.read_csv("path/to/basket/dir/all_star.csv")

dfs = {
    "players": df_players,
    "season": df_season,
    "all_star": df_all_star,
}

To build a Schema, users must import the Column, PrimaryKey, ForeignKey, Table, and Schema objects from the aindo.rdml.relational module. Tables and columns that are present in the data but that are not included in the Schema will be ignored.

from aindo.rdml.relational import Column, ForeignKey, PrimaryKey, Schema, Table

schema = Schema(
    players=Table(
        playerID=PrimaryKey(),
        pos=Column.CATEGORICAL,
        height=Column.NUMERIC,
        weight=Column.NUMERIC,
        college=Column.CATEGORICAL,
        race=Column.CATEGORICAL,
        birthCity=Column.CATEGORICAL,
        birthState=Column.CATEGORICAL,
        birthCountry=Column.CATEGORICAL,
    ),
    season=Table(
        playerID=ForeignKey(parent="players"),
        year=Column.INTEGER,
        stint=Column.INTEGER,
        tmID=Column.CATEGORICAL,
        lgID=Column.CATEGORICAL,
        GP=Column.INTEGER,
        points=Column.INTEGER,
        GS=Column.INTEGER,
        assists=Column.INTEGER,
        steals=Column.INTEGER,
        minutes=Column.INTEGER,
    ),
    all_star=Table(
        playerID=ForeignKey(parent="players"),
        conference=Column.CATEGORICAL,
        league_id=Column.CATEGORICAL,
        points=Column.INTEGER,
        rebounds=Column.INTEGER,
        assists=Column.INTEGER,
        blocks=Column.INTEGER,
    ),
)
print(schema)

Out:
Schema:
players:Table
Primary key: playerID
Feature columns:
  pos:<Column.CATEGORICAL: 'Categorical'>
  height:<Column.NUMERIC: 'Numeric'>
  weight:<Column.NUMERIC: 'Numeric'>
  college:<Column.CATEGORICAL: 'Categorical'>
  race:<Column.CATEGORICAL: 'Categorical'>
  birthCity:<Column.CATEGORICAL: 'Categorical'>
  birthState:<Column.CATEGORICAL: 'Categorical'>
  birthCountry:<Column.CATEGORICAL: 'Categorical'>
Foreign keys:

season:Table
Primary key: None
Feature columns:
  year:<Column.INTEGER: 'Integer'>
  stint:<Column.INTEGER: 'Integer'>
  tmID:<Column.CATEGORICAL: 'Categorical'>
  lgID:<Column.CATEGORICAL: 'Categorical'>
  GP:<Column.INTEGER: 'Integer'>
  points:<Column.INTEGER: 'Integer'>
  GS:<Column.INTEGER: 'Integer'>
  assists:<Column.INTEGER: 'Integer'>
  steals:<Column.INTEGER: 'Integer'>
  minutes:<Column.INTEGER: 'Integer'>
Foreign keys:
  playerID:ForeignKey(parent=players)
all_star:Table
Primary key: None
Feature columns:
  conference:<Column.CATEGORICAL: 'Categorical'>
  league_id:<Column.CATEGORICAL: 'Categorical'>
  points:<Column.INTEGER: 'Integer'>
  rebounds:<Column.INTEGER: 'Integer'>
  assists:<Column.INTEGER: 'Integer'>
  blocks:<Column.INTEGER: 'Integer'>
Foreign keys:
  playerID:ForeignKey(parent=players)

The `RelationalData` class

A RelationalData object is defined by combining the loaded data and a Schema object:

from aindo.rdml.relational import RelationalData, Schema

dfs = {
    "players": ...,
    "season": ...,
    "all_star": ...,
}
schema = Schema(...)
data = RelationalData(data=dfs, schema=schema)

The RelationalData.split() method allows to split the data into train, test and possibly validation sets, while respecting the consistency of the relational data structure.

from aindo.rdml.relational import RelationalData

data = RelationalData(data=..., schema=...)
data_train_valid, data_test = data.split(ratio=0.1)
data_train, data_valid = data_train_valid.split(ratio=0.1)

Relational data structure

The Schema class

Feature columns

Building a Schema

The RelationalData class

The `Schema` class

Building a `Schema`

The `RelationalData` class