Quick start

The aindo.rdml library uses pandas.DataFrame's as inputs and outputs. This means that the data must be loaded as one or more pandas.DataFrame's. It is then processed before feeding it to the generative model training routines. Finally, the trained models will generate the synthetic data and output pandas.DataFrame's.

Data Loading

To get started with the library, data must first be loaded in the main memory. In this example, we demonstrate how to do this with the pandas library and a single table dataset from a CSV file. Tabular data is organized into rows and columns, where columns represent attributes and rows represent observations of those attributes. For example, let's examine the first four columns of the UCI Adult single table dataset.

import pandas as pd

df = pd.read_csv("path/to/adult.data", usecols=["age", "workclass", "fnlwgt", "education"])
print(df)

[Out]:
       age          workclass  fnlwgt    education
0       39          State-gov   77516    Bachelors
1       50   Self-emp-not-inc   83311    Bachelors
2       38            Private  215646      HS-grad
3       53            Private  234721         11th
4       28            Private  338409    Bachelors
...    ...                ...     ...          ...
32556   27            Private  257302   Assoc-acdm
32557   40            Private  154374      HS-grad
32558   58            Private  151910      HS-grad
32559   22            Private  201490      HS-grad
32560   52       Self-emp-inc  287927      HS-grad
[32561 rows x 4 columns]

To use the aindo.rdml library, each dataset must be stored in a RelationalData object, which serves as the basic data structure. As the name suggests, this data structure can store both a single table and relational data involving multiple tables. A RelationalData object consists of two main attributes:

A Data object is a dictionary with tables’ names as keys and pandas.DataFrame's as values;
A Schema object contains the structure of the relations between tables, e.g. primary and foreign keys, and a description of the column types.

Let us define a RelationalData object for the Adult dataset, with a reduced number of columns, for simplicity. All the needed classes can be found in the aindo.rdml.relational module.

import pandas as pd
from aindo.rdml.relational import Column, Table, Schema, RelationalData

dfs = {"adult": pd.read_csv(...)}
schema = Schema(
    adult=Table(
        age=Column.INTEGER,
        workclass=Column.CATEGORICAL,
        fnlwgt=Column.INTEGER,
        education=Column.TEXT,
    )
)
data = RelationalData(data=dfs, schema=schema)
print(data)

[Out]:
Schema:
adult:Table
Primary key: None
Feature columns:
  age:<Column.INTEGER: 'Integer'>
  workclass:<Column.CATEGORICAL: 'Categorical'>
  fnlwgt:<Column.INTEGER: 'Integer'>
  education:<Column.TEXT: 'Text'>
Foreign keys:

Note that in the above example the categorical column education has been declared as a Column.TEXT just for the sake of showing an example of how a text column is treated in aindo.rdml. More correctly, we should have declared it as a Column.CATEGORICAL.

An example with a more complex, multi-table data structure can be found in the Relational module section.

Train / test data splitting (optional)

The RelationalData class offers a utility function to split the data into train, test and possibly validation sets, while respecting the consistency of the relational data structure.

from aindo.rdml.relational import RelationalData

data = RelationalData(data=..., schema=...)
data_train_valid, data_test = data.split(ratio=0.1)
data_train, data_valid = data_train_valid.split(ratio=0.1)

Data preprocessing

Data preprocessing involves transforming data columns before feeding them into the model. Preprocessing is performed through a TabularPreproc object, which can be found in the aindo.rdml.synth module. TabularPreproc objects can be built with the TabularPreproc.from_schema() method, which will build a default preprocessor based on the column types found in the provided Schema. After the instantiation, a TabularPreproc object needs to be fitted on a RelationalData object.

from aindo.rdml.relational import RelationalData
from aindo.rdml.synth import TabularPreproc

data = RelationalData(data=..., schema=...)
preproc = TabularPreproc.from_schema(schema=data.schema)
preproc.fit(data=data)

The preprocessing phase may also include additional operations to reduce the risk of privacy leaks, i.e. the risk of revealing personally identifiable information or sensitive data that was present in the original data. While the generative model does not copy individual data records, it could still potentially expose information if it generates data points containing rare categories or outlier numerical values.

To reduce this risk, it is possible to define a custom preprocessing of the columns through the argument preprocessors. This argument is expected to be a dictionary where the keys are the names of the tables, and the values are dictionaries containing ColumnPreproc objects for each column within the respective table. Instances of the ColumnPreproc class allow users to define custom preprocessing operations for individual columns.

For instance, to prevent the model from generating age "35" during data synthesis one would proceed as follows:

from aindo.rdml.relational import RelationalData
from aindo.rdml.synth import ColumnPreproc, TabularPreproc

data = RelationalData(data=..., schema=...)
preproc = TabularPreproc.from_schema(
    schema=data.schema,
    preprocessors={"adult": {"age": ColumnPreproc(non_sample_values=[35])}},
)
preproc.fit(data=data)

The preprocessing of text columns is managed by TextPreproc objects, one for each table containing text. Since in our example we already built a TabularPreproc, we can start from it to build the TextPreproc, using the TextPreproc.from_tabular() method and providing also the name of the table to consider.

from aindo.rdml.relational import RelationalData
from aindo.rdml.synth import TabularPreproc, TextPreproc

data = RelationalData(data=..., schema=...)
preproc = TabularPreproc.from_schema(schema=data.schema)
preproc.fit(data=data)
preproc_text = TextPreproc.from_tabular(preproc=preproc, table="adult")
preproc_text.fit(data=data)

Note that custom preprocessing of text columns is not supported.

Further details on preprocessing functionalities are provided in the Data preprocessing section.

Model training

The aindo.rdml library uses generative models that are trained to infer patterns and distributions of the original data.

The aindo.rdml.synth module offers two generative models for synthetic data generation, each with its own trainer:

A TabularModel, trained by a TabularTrainer, that generates all the relational data excluding columns that contain text.
A TextModel, trained by a TextTrainer, that generates only text columns. Users must define a TextModel for each table containing text columns.

To instantiate and build a TabularModel the user has to provide a TabularPreproc object along with a string indicating the desired model size (small, medium, or large). Larger models generally offer a greater performance in terms of quality of the learned patterns, but they may require more time to reach convergence.

To train the model, it is necessary to:

Instantiate a TabularTrainer.
Build a TabularDataset from the training data and the TabularPreproc.

The TabularTrainer.train() method is used to train the model, and it takes as input:

The TabularDataset containing the training data.
The maximum desired number of either training epochs (n_epochs) or training steps (n_steps).
Either the size of each batch of data with the batch_size argument, or alternatively the available memory (on CPU or GPU, depending on the chosen device) through the memory parameter. The latter is used to automatically estimate an optimal batch_size.

from aindo.rdml.relational import RelationalData
from aindo.rdml.synth import TabularDataset, TabularModel, TabularPreproc, TabularTrainer

data = RelationalData(data=..., schema=...)
data_train, data_test = data.split(ratio=0.1)
preproc = TabularPreproc.from_schema(schema=data.schema).fit(data=data)
preproc.fit(data=data)
model = TabularModel.build(preproc=preproc, size="small")
dataset_train = TabularDataset.from_data(data=data_train, preproc=preproc)
trainer = TabularTrainer(model=model)
trainer.train(
    dataset=dataset_train,
    n_epochs=100,
    batch_size=256,
)

The syntax is similar for TextModel instances, but in this case the user must also specify a block_size, corresponding to the maximum text length that the model can process in a single forward step. A reasonable value for the block_size can be recovered from the TextDataset.max_text_len attribute of the training dataset.

from aindo.rdml.relational import RelationalData
from aindo.rdml.synth import TextDataset, TextModel, TextPreproc, TextTrainer

data = RelationalData(data=..., schema=...)
data_train, data_test = data.split(ratio=0.1)
preproc_text = TextPreproc.from_schema_table(schema=data.schema, table="adult").fit(data=data)
dataset_train = TextDataset.from_data(data=data_train, preproc=preproc_text)
model_text = TextModel.build(
    preproc=preproc_text,
    size="small",
    block_size=dataset_train.max_text_len,
)
trainer_text = TextTrainer(model=model_text)
trainer_text.train(
    dataset=dataset_train,
    n_epochs=100,
    batch_size=32,
)

More customization parameters are available via the optional arguments described in the Model training section.

Synthetic data generation

Once the generative model is trained, it can generate synthetic data that closely mirrors the original without containing any personally identifiable information, ensuring both privacy and utility for various applications.

To generate synthetic data using a TabularModel it is enough to call the TabularModel.generate() method, which returns a RelationalData object containing the synthetic data. It is necessary to provide the number of samples to be generated through the n_samples parameter. Optionally, the user can specify:

batch_size, the batch size used during generation. Defaults to 0, which means that all the data is generated in a single batch.
temp, a strictly positive real number describing the amount of noise used in generation. The default value is 1. Larger values will introduce more variance, lower values will decrease it.

For instance, let's generate the same number of rows as in the original adult table, with a batch size of 1024.

import pandas as pd
from aindo.rdml.synth import TabularModel

df = pd.read_csv(...)
model = TabularModel.build(preproc=..., size=...)

# Train the tabular model
...

data_synth = model.generate(
    n_samples=df.shape[0],
    batch_size=1024,
)

A TabularModel only generates non-text columns. An example of the output of the previous generation is the following:

{'adult':
    age   workclass  fnlwgt
 0   31     Private  108501
 1   39   Local-gov  228490
 2   11     Private  187810
 3   47     Private  113026
 4   26     Private  465070
 ...}

To generate the text column we need to use the TextModel and provide the tabular data that we just generated.

import pandas as pd
from aindo.rdml.synth import TabularModel, TextModel

df = pd.read_csv(...)
model = TabularModel.build(preproc=..., size=...)
model_text = TextModel.build(preproc=..., size=..., block_size=...)

# Train the tabular and text models
...

data_synth = model.generate(
    n_samples=df.shape[0],
    batch_size=1024,
)
data_synth = model_text.generate(
    data=data_synth,
    batch_size=512,
)

The output data_synth is a RelationalData object containing the synthetic version of the original data, including the previously missing text column.

[Out]:
{'adult':
    age   workclass  fnlwgt                                education
 0   31     Private  108501   EntityItem B-grad HS-Flagscollegeachel
 1   39   Local-gov  228490                        achelachel-school
 2   11     Private  187810                                  HS-grad
 3   47     Private  113026                     itu Kara-assycollege
 4   26     Private  465070                                8achelors
 ...}

The Airbnb script shows a more realistic example of text generation, with a relational data and with text columns in two different tables.

Evaluation

The aindo.rdml library also includes some tools to evaluate the generated synthetic data. These are found in the aindo.rdml.eval module.

The report() function outputs a PDF displaying the key metrics for the evaluation of the generated synthetic data in terms of both data quality and privacy protection. This function needs training and test data splits, the generated synthetic data and an output path for the PDF file.

The compute_privacy_stats() function performs a more detailed analysis of the privacy metrics. On top of the privacy score, it provides an estimate of its standard deviation, and the estimated fraction of real data points at risk.

from aindo.rdml.eval import report, compute_privacy_stats
from aindo.rdml.relational import RelationalData

data = RelationalData(data=..., schema=...)
data_train, data_test = data.split(ratio=0.1)

# Generate synthetic data
...
data_synth = ...

report(
    data_train=data_train,
    data_test=data_test,
    data_synth=data_synth,
    path="./report.pdf",
)

privacy_stats = compute_privacy_stats(
    data_train=data_train,
    data_synth=data_synth,
)
for t in data.schema.tables:
    print(f"Table: {t}")
    print(f"Privacy score: {privacy_stats[t].privacy_score:.2%f} ({privacy_stats[t].privacy_score_std:.3%f})")
    print(f"% training points at risk: {privacy_stats[t].risk * 100:.1%f}")