Neural models
The aindo.rdml.synth
module allows to:
- Preprocess columns within each table.
- Train generative models on relational tabular data.
- Generate synthetic data.
- Perform predictive tasks on relational tabular data.
To illustrate the full process, from preprocessing to synthetic data generation, in the following sections we will use the Airbnb Open Data dataset. The original dataset consists of a single table, but upon further inspection, it is clear that we can rearrange it in a more "natural" form, by splitting it into two tables:
- A table
host
, with primary keyhost_id
. - A table
listings
, with primary keyid
and foreign keyhost_id
, referring to the primary key ofhost
.
Both tables have a text column, host_name
in host
, and name
in listings
.
For simplicity, we will focus here on the latter, however the same operations can be performed on the host
table too.
We can define the Schema
and load the data, as follows:
import pandas as pd
from aindo.rdml.relational import Schema, Table, Column, PrimaryKey, ForeignKey, RelationalData
schema = Schema(
host=Table(
host_id=PrimaryKey(),
host_name=Column.TEXT,
calculated_host_listings_count=Column.NUMERIC,
),
listings=Table(
id=PrimaryKey(),
host_id=ForeignKey(parent="host"),
name=Column.TEXT,
neighbourhood_group=Column.CATEGORICAL,
neighbourhood=Column.CATEGORICAL,
latitude=Column.NUMERIC,
longitude=Column.NUMERIC,
room_type=Column.CATEGORICAL,
price=Column.INTEGER,
minimum_nights=Column.INTEGER,
number_of_reviews=Column.INTEGER,
last_review=Column.DATETIME,
reviews_per_month=Column.NUMERIC,
availability_365=Column.INTEGER,
),
)
df = pd.read_csv("path/to/airbnb.csv")
dfs = {
"host": df.loc[:, list(schema.tables["host"].all_columns)].drop_duplicates(),
"listings": df.loc[:, list(schema.tables["listings"].all_columns)],
}
data = RelationalData(data=dfs, schema=schema)
In the following sections we will often refer to the Airbnb dataset to provide examples of the different operations showcased. In the Airbnb script, a full end-to-end example using the Airbnb dataset is laid out, and both text columns are taken in into account.