Model training

The aindo.rdml library offers two generative models for synthetic data generation:

A TabularModel that generates all the relational data excluding columns that contain text.
A TextModel that generates only text columns. Users must specify a TextModel for each table containing text columns.

Tabular Model

To instantiate and build a TabularModel the user needs to use the TabularModel.build() class method and to provide a preproc, which is a TabularPreproc object, and a size, denoting the desired model dimensions. The size argument can be defined in one of the following formats:

A TabularModelSize object containing the integer attributes n_layers, h and d;
A string or a Size object, internally mapping to a default configuration of TabularModelSize. The options are: "small"/Size.SMALL, "medium"/Size.MEDIUM, or "large"/Size.LARGE.

The user may specify the type of layer used by the model with the block parameter. The available blocks are "free" (the default) and "lstm". Optionally, the user may also provide a dropout value for the dropout layers in the model.

The model is trained using a TabularTrainer object, which is built from the TabularModel. The trainer has an optional parameter dp_budget, which, if provided, must be a DpBudget object containing the (epsilon, delta)-budget for differentially private (DP) training. If not provided, the training will have no differential privacy guarantees. Notice that DP training is available only for single-table datasets.

To train a model, the user also needs to build a TabularDataset object containing the preprocessed training data. The TabularDataset is built from the raw training data and the same TabularPreproc object used to build the model. There are three options to instantiate a TabularDataset object:

From the raw data, and storing the processed data in RAM. In this case the TabularDataset.from_data() method should be invoked.
From the raw data, but storing the processed data on disk. In this case, again the TabularDataset.from_data() method should be invoked, but the on_disk parameter should be set to True. Moreover, the path parameter can be used to provide a directory where to store the processed data. By default, the data is stored in a temporary directory and deleted at the end of the process. When stored on disk, during training the data will be loaded one batch at a time. This may slightly slow down the training, but will reduce the memory consumption.
From data already processed and stored non disk. When using the TabularDataset.from_data() method, with on_disk set to True and providing a path, the data is stored in the provided directory, and can be reaccessed for later use with the TabularDataset.from_disk() method,
providing the TabularPreproc and the path to the directory.

The TabularDataset has another optional argument, block_size, which is an integer fixing the maximum length of the internal representation of the input used during training. A smaller block_size will reduce the time of a single training epoch, but will introduce approximations that may compromise the quality of the generated synthetic data. The given block_size should be larger than the maximal internal representation of each table in the dataset. For this reason, this parameter is available only for multi-table datasets.

Once the trainer and the training dataset are ready, the TabularTrainer.train() method is used to train the model. The method requires:

The training dataset (dataset);
The desired number of training epochs (n_epochs), or alternatively of training steps (n_steps);
Either the batch size (batch_size) or the (CPU or GPU) available memory in MB (memory). The latter is used in turn to compute an optimal batch size, such that:
- The batch size is a power of two and does not exceed 256, batch_size = 2**n, 0 <= n <= 8;
- Each epoch consists of at least 50 steps, len(dataset) // batch_size >= 50;
- Each training step will not require more (CPU or GPU) memory than the available one provided.

Additionally, users can provide the optional arguments:

lr: The learning rate, whose optimal value is otherwise automatically determined.
valid: A Validation object that configures validation during training. The validation dataset must be provided as a TabularDataset object via the argument dataset, and various functionalities can be activated with the dedicated arguments, including learning rate scheduling and early stopping. To protect the validation data with DP guarantees, a DpValid object should be provided through the dp parameter. For further information, please refer to the API reference.
hooks: A sequence of custom training hooks crafted by the user, described in the next section.
accumulate_grad: The number of gradient accumulation steps. By default, it is set to 1, meaning the model is updated at each step.
dp_step: A DpStep object containing the data needed for the differentially private step. It should be provided if and only if the trainer was equipped with a DP-budget, and therefore only for single-table datasets. For the available settings, please refer to the API reference.
world_size: The number of GPUs to use for distributed training. If more than one GPU is available, distributing the training over the available GPUs can speed up the training. If 0 (the default), the training is performed on a single device, the current device of the TabularTrainer object.

Here is an example of training of the tabular model, with a validation step at the end of each epoch:

from aindo.rdml.relational import RelationalData
from aindo.rdml.synth import TabularDataset, TabularModel, TabularPreproc, TabularTrainer, Validation

data = RelationalData(data=..., schema=...)
data_train, data_valid = data.split(ratio=0.1)

preproc = TabularPreproc.from_schema(schema=data.schema).fit(data=data)

model_tabular = TabularModel.build(preproc=preproc, size="small")

dataset_train = TabularDataset.from_data(data=data_train, preproc=preproc)
dataset_valid = TabularDataset.from_data(data=data_valid, preproc=preproc)

trainer_tabular = TabularTrainer(model=model_tabular)
trainer_tabular.train(
    dataset=dataset_train,
    n_epochs=100,
    batch_size=32,
    valid=Validation(dataset=dataset_valid, each=1, trigger="epoch"),
)

Custom hooks (expert user)

The experienced user might opt to specify personalized training hooks using the hooks parameter of the TabularTrainer.train() method. These hooks must extend the TrainHook class, whose __init__() method takes at least two arguments to define the frequency of the activation of the hook: an integer each, and a trigger, that may be "epoch" or "step". A custom hook must implement the _hook(n) method, which is invoked when the hook is triggered by the each and trigger arguments and receives as an argument the number of current epoch or current step, depending on the value of trigger.

A custom hook may also override the following methods:

setup(trainer, hooks), invoked before the training begins, takes as input the trainer and the previously defined hooks.
hook(), called at each training step. The default behavior is to check if the trigger is activated and in such case calls the _hook() method.
_cleanup(), called at the end of the training, it should return the status of the current hook.
cleanup(hook_status), called at the end of the training, receives in input the status of the previous hooks and should return the status of the current hook. Its default behavior is to check the statuses of the previous hooks and to call the _cleanup() method.

Text Model

As for the TabularModel, to instantiate and build a TextModel instance, the user is required to provide a preproc, which in this case is a TextPreproc, and a size, which is a TextModelSize, a Size, or a string representation of the latter. For a TextModel, the user is also required to provide a block_size, corresponding to the maximum text length that the model can process in a single forward step. Finally, the user may provide the optional dropout parameter.

Alternatively, the user may build a TextModel from a pretrained model, with the constructor TextModel.build_from_pretrained(), providing a TextPreproc and a path to the pretrained model. The optional block_size option is also available to fix the maximum text length that the model can process during fine-tuning.

To build the training (and validation) dataset, the user must instantiate a TextDataset object. The options are similar to the ones for the TabularDataset, however in this case the max_block_size parameter is not available. To reduce the block size, it is possible to set the block_size parameter in the TextModel.build(), or the TextModel.max_block_size attribute. A reasonable value for the block size can be obtained from the TextDataset.max_text_len attribute of the training dataset.

The associated trainer is a TextTrainer object, which is built from a TextModel. At the moment, DP training is not available for TextTrainer models, therefore the dp_budget option is not available. The TextTrainer.train() method has the same arguments as the TabularTrainer.train() method, except for the dp_step option which is not active.

from aindo.rdml.relational import RelationalData
from aindo.rdml.synth import TextDataset, TextModel, TextPreproc, TextTrainer

data = RelationalData(data=..., schema=...)
data_train, data_valid = data.split(ratio=0.1)

preproc_text = TextPreproc.from_schema_table(schema=data.schema, table="listings").fit(data=data)

model_text = TextModel.build(
    preproc=preproc_text,
    size="small",
    block_size=1024,
)

dataset_train = TextDataset.from_data(data=data_train, preproc=preproc_text)

trainer_text = TextTrainer(model=model_text)
trainer_text.train(
    dataset=dataset_train,
    n_epochs=100,
    batch_size=32,
)