Synthetic data generation
After training the TabularModel
, generating synthetic data becomes straightforward
by using its TabularModel.generate()
method.
This method takes as input one (and only one) of the following:
n_samples
: The number of samples to generate in the root table.ctx
: A dictionary ofpandas.DataFrame
's including the context from which to start the synthetic data generation. See the section Generation from context for more details.
Optionally, the user can also specify:
batch_size
: The number of samples generated in parallel. Defaults to 0, which means that all the data is generated in a single batch.max_block_size
: This parameter limits the length of each generated sample (in terms of its internal representation). It is active only for multi-table datasets. The default value is 0, meaning no limit is enforced. A reasonable value for this parameter can be obtained from theTabularDataset.block_size
attribute of the dataset.temp
: A strictly positive number describing the amount of noise used in generation. The default value is 1. Larger values will introduce more variance, lower values will decrease it.
from aindo.rdml.relational import RelationalData
from aindo.rdml.synth import TabularModel
data = RelationalData(data=..., schema=...)
model_tabular = TabularModel.build(preproc=..., size=...)
# Train the tabular model
# as shown above
...
data_synth = model_tabular.generate(
n_samples=data["host"].shape[0],
batch_size=32,
)
This model only generates non-text columns.
The missing text columns are generated by the trained TextModel
's,
by means of the TextModel.generate()
method.
The optional arguments are:
batch_size
: The batch size used during generation. Defaults to 0, which means that all the data is generated in a single batch.max_text_len
: The maximum length of the generated text for each table row. The default value is 0, meaning the maximum possible value is used, namely the value of theTextModel.max_block_size
attribute.temp
: As for the tabular model, this parameter controls the amount of noise used in generation, and the default value is 1.
from aindo.rdml.relational import RelationalData
from aindo.rdml.synth import TabularModel, TextModel
data = RelationalData(data=..., schema=...)
model_tabular = TabularModel.build(preproc=..., size=...)
model_text = TextModel.build(preproc=..., size=..., block_size=...)
# Train the tabular and text models
# as shown above
...
data_synth = model_tabular.generate(
n_samples=data["host"].shape[0],
batch_size=32,
)
data_synth = model_text.generate(
data=data_synth,
batch_size=32,
)
At the end of the procedure, data_synth
is a RelationalData
object
containing the synthetic version of the Airbnb dataset, including the text column name
in the listings
table.
Note that in order to generate also the host_name
text column present in the host
table,
we should build and train a second TextModel
and then generate the host_name
column
in a similar fashion to what was done for the name
column in the listings
table.
A full example of that can be found in the Airbnb script.
Generation from context
In its basic form, the TabularModel
generates synthetic data from scratch
by simply specifying the number of samples to generate using the n_samples
argument
in the TabularModel.generate()
method.
However, it is also possible to conditionally generate data based on a given context,
where the user provides part of the data (the context) and the model generates the remaining portion.
To perform conditional generation, the user must:
- Specify the context columns when creating the
TabularPreproc
object using thectx_cols
parameter. - Provide the context as
pandas.DataFrame
's to thectx
parameter of theTabularModel.generate()
method.
Two types of context can be provided:
-
One or more columns from the root table:
In this case, the generation of the root table will start from the provided columns, and the rest of the relational data will be generated as usual. Specifically, the foreign keys of the synthetic data will also be generated, and primary keys for all tables except the root table will be generated as well. -
A (possibly empty) subset of columns from each table:
In this scenario, the user must supply the primary keys and foreign keys for each table. The generated synthetic data will retain the keys specified in the context.
Note that foreign keys referring to lookup tables should be treated as feature columns, not as foreign keys.
If these are included in the context, they must be specified
in the TabularPreproc
and included in the context passed to
the TabularModel.generate()
method.
Otherwise, they should be ignored.
For example, if you want to generate synthetic data where the "calculated_host_listings_count"
column
of the root table "host"
is identical to the original, you could proceed as follows:
from aindo.rdml.relational import RelationalData
from aindo.rdml.synth import TabularModel, TabularPreproc
data = RelationalData(data=..., schema=...)
preproc = TabularPreproc.from_schema(
schema=data.schema,
ctx_cols={"host": ["calculated_host_listings_count"]},
)
preproc.fit(data=data)
model = TabularModel.build(preproc=preproc, size=...)
# Train the tabular model
...
ctx = {"host": data["host"].loc[:, ["calculated_host_listings_count"]]}
data_synth = model.generate(
ctx=ctx,
batch_size=32,
)
Alternatively, if you want to retain certain columns from the original "listings"
table,
you would also need to specify the relational structure of the context, including the primary and foreign keys.
from aindo.rdml.relational import RelationalData
from aindo.rdml.synth import TabularModel, TabularPreproc
data = RelationalData(data=..., schema=...)
preproc = TabularPreproc.from_schema(
schema=data.schema,
ctx_cols={"host": [], "listings": ["room_type", "price"]},
)
preproc.fit(data=data)
model = TabularModel.build(preproc=preproc, size=...)
# Train the tabular model
...
ctx = {
"host": data["host"].loc[:, ["host_id"]],
"listings": data["listings"].loc[:, ["host_id", "id", "room_type", "price"]],
}
data_synth = model.generate(
ctx=ctx,
batch_size=32,
)
Building the context
In general, the user is responsible for providing the data that serves as the context for conditional generation.
However, specifying the context can be challenging, especially for complex relational structures.
The TabularPreproc
class offers two utility methods that can
assist in building a context starting from some RelationalData
.
The latter may come from holdout data (unused during training) or even from part of the training set.
-
TabularPreproc.select_ctx()
: This method takes aRelationalData
object as input and optionally accepts a sequence of integer indices for the root table via theidx
parameter. If the indices are not provided, the context is extracted directly from the providedRelationalData
, and all primary and foreign keys are retained. However, if a set of indices are provided, they will be used to select the corresponding samples in the root table, along with their child rows. Repeated indices are allowed, and all keys will be reset accordingly. -
TabularPreproc.sample_ctx()
: This method builds uponTabularPreproc.select_ctx()
, but instead of using specified indices, the rows are randomly sampled (with replacement). The optionaln_samples
parameter can be used to specify the number of samples to select randomly. By default,n_samples
is set to the number of samples in the root table of the provided data. Additionally, the optionalrng
parameter allow users to control randomness by providing either anumpy.random.Generator
or an integer seed.