Synthetic data generation

After training the TabularModel, generating synthetic data becomes straightforward by using its TabularModel.generate() method. This method takes as input one (and only one) of the following:

n_samples: The number of samples to generate in the root table.
ctx: A dictionary of pandas.DataFrame's including the context from which to start the synthetic data generation. See the section Generation from context for more details.

Optionally, the user can also specify:

chunk_size: If a positive integer is provided, the context or the number of samples to generate is split in chunks of the given size, and the generation is performed one chunk at a time, including the pre-processing of the context and the post-processing of the generated synthetic data. By default, the generation is performed in a single chunk.

Tip

It is suggested to use chunked generation when the size of the dataset is such that the pre- and post-processing operations requires more RAM than the available one. Chunked generation takes longer than the normal one, but reduces the memory footprint of the generation process.

batch_size: The number of samples generated in parallel. Defaults to 0, which means that all the data is generated in a single batch.
max_block_size: This parameter limits the length of each generated sample (in terms of its internal representation). It is active only for multi-table datasets. The default value is 0, meaning no limit is enforced. A reasonable value for this parameter can be obtained from the TabularDataset.block_size attribute of the dataset.
temp: A strictly positive number describing the amount of noise used in generation. The default value is 1. Larger values will introduce more variance, lower values will decrease it.

from aindo.rdml.relational import RelationalData
from aindo.rdml.synth import TabularModel

data = RelationalData(data=..., schema=...)
model_tabular = TabularModel.build(preproc=..., size=...)

# Train the tabular model
# as shown above
...

data_synth = model_tabular.generate(
    n_samples=data["host"].shape[0],
    batch_size=32,
)

This model only generates non-text columns. The missing text columns are generated by the trained TextModel's, by means of the TextModel.generate() method.

The optional arguments are:

batch_size: The batch size used during generation. Defaults to 0, which means that all the data is generated in a single batch.
max_text_len: The maximum length of the generated text for each table row. The default value is 0, meaning the maximum possible value is used, namely the value of the TextModel.max_block_size attribute.
temp: As for the tabular model, this parameter controls the amount of noise used in generation, and the default value is 1.

from aindo.rdml.relational import RelationalData
from aindo.rdml.synth import TabularModel, TextModel

data = RelationalData(data=..., schema=...)
model_tabular = TabularModel.build(preproc=..., size=...)
model_text = TextModel.build(preproc=..., size=..., block_size=...)

# Train the tabular and text models
# as shown above
...

data_synth = model_tabular.generate(
    n_samples=data["host"].shape[0],
    batch_size=32,
)

data_synth = model_text.generate(
    data=data_synth,
    batch_size=32,
)

At the end of the procedure, data_synth is a RelationalData object containing the synthetic version of the Airbnb dataset, including the text column name in the listings table.

Note that in order to generate also the host_name text column present in the host table, we should build and train a second TextModel and then generate the host_name column in a similar fashion to what was done for the name column in the listings table. A full example of that can be found in the Airbnb script.

Generation from context

In its basic form, the TabularModel generates synthetic data from scratch by simply specifying the number of samples to generate using the n_samples argument in the TabularModel.generate() method. However, it is also possible to conditionally generate data based on a given context, where the user provides part of the data (the context) and the model generates the remaining portion.

To perform conditional generation, the user must:

Specify the context columns when creating the TabularPreproc object using the ctx_cols parameter.
Provide the context as pandas.DataFrame's to the ctx parameter of the TabularModel.generate() method.

Two types of context can be provided:

One or more columns from the root table:
In this case, the generation of the root table will start from the provided columns, and the rest of the relational data will be generated as usual. Specifically, the foreign keys of the synthetic data will also be generated, and primary keys for all tables except the root table will be generated as well.
A (possibly empty) subset of columns from each table:
In this scenario, the user must supply the primary keys and foreign keys for each table. The generated synthetic data will retain the keys specified in the context.

Note that foreign keys referring to lookup tables should be treated as feature columns, not as foreign keys. If these are included in the context, they must be specified in the TabularPreproc and included in the context passed to the TabularModel.generate() method. Otherwise, they should be ignored.

For example, if you want to generate synthetic data where the "calculated_host_listings_count" column of the root table "host" is identical to the original, you could proceed as follows:

from aindo.rdml.relational import RelationalData
from aindo.rdml.synth import TabularModel, TabularPreproc

data = RelationalData(data=..., schema=...)
preproc = TabularPreproc.from_schema(
    schema=data.schema,
    ctx_cols={"host": ["calculated_host_listings_count"]},
)
preproc.fit(data=data)
model = TabularModel.build(preproc=preproc, size=...)

# Train the tabular model
...

ctx = {"host": data["host"].loc[:, ["calculated_host_listings_count"]]}
data_synth = model.generate(
    ctx=ctx,
    batch_size=32,
)

Alternatively, if you want to retain certain columns from the original "listings" table, you would also need to specify the relational structure of the context, including the primary and foreign keys.

from aindo.rdml.relational import RelationalData
from aindo.rdml.synth import TabularModel, TabularPreproc

data = RelationalData(data=..., schema=...)
preproc = TabularPreproc.from_schema(
    schema=data.schema,
    ctx_cols={"host": [], "listings": ["room_type", "price"]},
)
preproc.fit(data=data)
model = TabularModel.build(preproc=preproc, size=...)

# Train the tabular model
...

ctx = {
    "host": data["host"].loc[:, ["host_id"]],
    "listings": data["listings"].loc[:, ["host_id", "id", "room_type", "price"]],
}
data_synth = model.generate(
    ctx=ctx,
    batch_size=32,
)

Building the context

In general, the user is responsible for providing the data that serves as the context for conditional generation. However, specifying the context can be challenging, especially for complex relational structures. The TabularPreproc class offers two utility methods that can assist in building a context starting from some RelationalData. The latter may come from holdout data (unused during training) or even from part of the training set.

TabularPreproc.select_ctx(): This method takes some data as input and optionally accepts a sequence of integer indices for the root table via the idx parameter. If the indices are not provided, the context is extracted directly from the provided data, and all primary and foreign keys are retained. However, if the user provides a set of indices, they will be used to select the corresponding samples in the root table, along with their child rows. Repeated indices are allowed, and all keys will be reset accordingly.
TabularPreproc.sample_ctx(): This method builds upon TabularPreproc.select_ctx(), but instead of using specified indices, the rows are randomly sampled (with replacement). The optional n_samples parameter can be used to specify the number of samples to select randomly. By default, n_samples is set to the number of samples in the root table of the provided data. Additionally, the optional rng parameter allow users to control randomness by providing either a numpy.random.Generator or an integer seed.