Introduction

In the following sections, we provide an in-depth discussion of how to effectively use the classes and functions in each module to meet the specific needs of the user.

We begin by demonstrating how users can define the schema of a relational data structure, including column types, tables, and relationships between tables (such as primary and foreign keys), and how to load data according to these defined relational structures. This is achieved through the aindo.rdml.relational module.

Next, we focus on generating synthetic data using the neural generative models found in the aindo.rdml.relational module. We start with a detailed explanation of how to preprocess the data, then move on to show how to build and train the generative models. Different models are employed to generate the structured portion of the tabular data as well as the text columns. Additionally, we demonstrate how the tabular models can be used for predictive tasks on relational data.

In addition to the neural generative models, the aindo.rdml.synth module also includes a tree-based model using XGBoost, designed for handling the case of a single table.

In the following section, we describe how to use the aindo.rdml.synth.event module to generate synthetic time series of events. This module provides a model that can generate entirely new synthetic time series or extend existing ones. The workflow closely resembles that used for generating synthetic tabular data: the original data must first be preprocessed, then used to train a generative model, which is subsequently used to produce synthetic events. However, the use case differs, and in the following sections, we outline the specific applications of each model and provide guidance on how to use them effectively.

Finally, we introduce the aindo.rdml.synth.llm module, which contains a set of models that leverage the power of Large Language Models (LLMs) to generate synthetic data. These models complement the other neural models available in the aindo.rdml.synth module by enabling the generation of entirely new data. For example, they can be used to:

Generate a dataset from scratch based on a given input structure.
Enrich an existing dataset by adding new columns that are consistent with the existing data.

In the last section we discuss how to use the functions in the aindo.rdml.eval module to evaluate the quality of the generated data, focusing on both similarity and privacy protection.