Skip to content

Neural models

The aindo.rdml.synth module allows to:

  1. Preprocess columns within each table.
  2. Train generative models on relational tabular data.
  3. Generate synthetic data.
  4. Perform predictive tasks on relational tabular data.

To illustrate the full process, from preprocessing to synthetic data generation, in the following sections we will use the Airbnb Open Data dataset. The original dataset consists of a single table, but upon further inspection, it is clear that we can rearrange it in a more "natural" form, by splitting it into two tables:

  1. A table host, with primary key host_id.
  2. A table listings, with primary key id and foreign key host_id, referring to the primary key of host.

Both tables have a text column, host_name in host, and name in listings. For simplicity, we will focus here on the latter, however the same operations can be performed on the host table too.

We can define the Schema and load the data, as follows:

import pandas as pd

from aindo.rdml.relational import Schema, Table, Column, PrimaryKey, ForeignKey, RelationalData


schema = Schema(
  host=Table(
      host_id=PrimaryKey(),
      host_name=Column.TEXT,
      calculated_host_listings_count=Column.NUMERIC,
  ),
  listings=Table(
      id=PrimaryKey(),
      host_id=ForeignKey(parent="host"),
      name=Column.TEXT,
      neighbourhood_group=Column.CATEGORICAL,
      neighbourhood=Column.CATEGORICAL,
      latitude=Column.NUMERIC,
      longitude=Column.NUMERIC,
      room_type=Column.CATEGORICAL,
      price=Column.INTEGER,
      minimum_nights=Column.INTEGER,
      number_of_reviews=Column.INTEGER,
      last_review=Column.DATETIME,
      reviews_per_month=Column.NUMERIC,
      availability_365=Column.INTEGER,
  ),
)
df = pd.read_csv("path/to/airbnb.csv")
dfs = {
    "host": df.loc[:, list(schema.tables["host"].all_columns)].drop_duplicates(),
    "listings": df.loc[:, list(schema.tables["listings"].all_columns)],
}
data = RelationalData(data=dfs, schema=schema)

In the following sections we will often refer to the Airbnb dataset to provide examples of the different operations showcased. In the Airbnb script, a full end-to-end example using the Airbnb dataset is laid out, and both text columns are taken in into account.