Relational data structure
The primary data structure of the aindo.rdml
package is the
RelationalData
class, which encapsulate a relational dataset
consisting of one or more tables.
A RelationalData
object is a dictionary of pandas.DataFrame
objects
that contain the data tables, along with a Schema
object that describes:
- The types of the table columns.
- The relational structure of the dataset, including primary and foreign keys.
The RelationalData
class, together with the
Schema
class and other necessary classes for building one,
can be found in the aindo.rdml.relational
module.
The Schema
class
A Schema
object is a collection of named
Table
objects, also containing the information about relations among tables.
Each Table
object contains the columns of interest of that table.
There are two primary types of columns:
PrimaryKey
's andForeignKey
's define the relational structure of the data.- Feature
Column
's, namely columns that are not keys.
Feature columns
When building a Schema
, to each feature column is associated
a Column
type.
The associated type will instruct the various routines of the library on how to treat the data in the column.
For example, a Column.CATEGORICAL
will be preprocessed differently than a Column.INTEGER
before being fed to the
generative model during training
(more info in the ColumnPreproc section).
It will also appear differently in the evaluation report
(more info in the Synthetic data report section).
The available Column
types are:
BOOLEAN
, CATEGORICAL
, NUMERIC
, INTEGER
, DATE
, TIME
, DATETIME
, COORDINATES
, ITAFISCALCODE
, and TEXT
.
Building a Schema
To illustrate how to build a Schema
from scratch, let us work with (a subset of) the
BasketballMen dataset,
which consists of the following tables:
players
: The root table with the primary keyplayerID
.season
: A child table of players linked via the foreign keyplayerID
.all_star
: Another child table of players connected by the foreign keyplayerID
.
Let us load the tables with pandas
and let us gather the pandas.DataFrame
's into a dictionary:
import pandas as pd
df_players = pd.read_csv("path/to/basket/dir/players.csv")
df_season = pd.read_csv("path/to/basket/dir/season.csv")
df_all_star = pd.read_csv("path/to/basket/dir/all_star.csv")
dfs = {
"players": df_players,
"season": df_season,
"all_star": df_all_star,
}
To build a Schema
, users must import the Column
,
PrimaryKey
, ForeignKey
,
Table
, and Schema
objects
from the aindo.rdml.relational
module.
Tables and columns that are present in the data but that are not included in the
Schema
will be ignored.
from aindo.rdml.relational import Column, ForeignKey, PrimaryKey, Schema, Table
schema = Schema(
players=Table(
playerID=PrimaryKey(),
pos=Column.CATEGORICAL,
height=Column.NUMERIC,
weight=Column.NUMERIC,
college=Column.CATEGORICAL,
race=Column.CATEGORICAL,
birthCity=Column.CATEGORICAL,
birthState=Column.CATEGORICAL,
birthCountry=Column.CATEGORICAL,
),
season=Table(
playerID=ForeignKey(parent="players"),
year=Column.INTEGER,
stint=Column.INTEGER,
tmID=Column.CATEGORICAL,
lgID=Column.CATEGORICAL,
GP=Column.INTEGER,
points=Column.INTEGER,
GS=Column.INTEGER,
assists=Column.INTEGER,
steals=Column.INTEGER,
minutes=Column.INTEGER,
),
all_star=Table(
playerID=ForeignKey(parent="players"),
conference=Column.CATEGORICAL,
league_id=Column.CATEGORICAL,
points=Column.INTEGER,
rebounds=Column.INTEGER,
assists=Column.INTEGER,
blocks=Column.INTEGER,
),
)
print(schema)
Out:
Schema:
players:Table
Primary key: playerID
Feature columns:
pos:<Column.CATEGORICAL: 'Categorical'>
height:<Column.NUMERIC: 'Numeric'>
weight:<Column.NUMERIC: 'Numeric'>
college:<Column.CATEGORICAL: 'Categorical'>
race:<Column.CATEGORICAL: 'Categorical'>
birthCity:<Column.CATEGORICAL: 'Categorical'>
birthState:<Column.CATEGORICAL: 'Categorical'>
birthCountry:<Column.CATEGORICAL: 'Categorical'>
Foreign keys:
season:Table
Primary key: None
Feature columns:
year:<Column.INTEGER: 'Integer'>
stint:<Column.INTEGER: 'Integer'>
tmID:<Column.CATEGORICAL: 'Categorical'>
lgID:<Column.CATEGORICAL: 'Categorical'>
GP:<Column.INTEGER: 'Integer'>
points:<Column.INTEGER: 'Integer'>
GS:<Column.INTEGER: 'Integer'>
assists:<Column.INTEGER: 'Integer'>
steals:<Column.INTEGER: 'Integer'>
minutes:<Column.INTEGER: 'Integer'>
Foreign keys:
playerID:ForeignKey(parent=players)
all_star:Table
Primary key: None
Feature columns:
conference:<Column.CATEGORICAL: 'Categorical'>
league_id:<Column.CATEGORICAL: 'Categorical'>
points:<Column.INTEGER: 'Integer'>
rebounds:<Column.INTEGER: 'Integer'>
assists:<Column.INTEGER: 'Integer'>
blocks:<Column.INTEGER: 'Integer'>
Foreign keys:
playerID:ForeignKey(parent=players)
The RelationalData
class
A RelationalData
object is defined by combining the loaded data
and a Schema
object:
from aindo.rdml.relational import RelationalData, Schema
dfs = {
"players": ...,
"season": ...,
"all_star": ...,
}
schema = Schema(...)
data = RelationalData(data=dfs, schema=schema)
The RelationalData.split()
method allows to split the data
into train, test and possibly validation sets, while respecting the consistency of the relational data structure.
from aindo.rdml.relational import RelationalData
data = RelationalData(data=..., schema=...)
data_train_valid, data_test = data.split(ratio=0.1)
data_train, data_valid = data_train_valid.split(ratio=0.1)