Event generation

To train and generate synthetic event data, the aindo.rdml.synth.eval package provides the EventModel class.

Similar to a TabularModel, an EventModel should be built, trained, and then used for data generation. It can also be saved and later loaded for continued use in another session.

Building the model

An EventModel is instantiated using the EventModel.build() class method. You must provide:

An EventPreproc object.
The model size, which can be a TabularModelSize, a Size, or a string equivalent of the latter.
Optional arguments: block (either "free" [default] or "lstm"), and dropout.

These arguments mirror those in TabularModel.build().

from aindo.rdml.relational import RelationalData
from aindo.rdml.synth.event import EventModel, EventPreproc

data = RelationalData(data=..., schema=...)
preproc = EventPreproc.from_schema(
    schema=data.schema,
    ord_cols={"visits": "date", "diagnosis": "date", ...},
    final_tables=["discharge"],
).fit(data=data)
model = EventModel.build(preproc=preproc, size="small")

Training the model

To train an EventModel, use the EventTrainer class.

Training requires an EventDataset, which is created from the training data and the same preprocessor used for building the model. Use the EventDataset.from_data() class method to process the data and create the dataset.

By default, data is stored in memory. Alternatively, to save it on disk, set on_disk=True. If a path is provided to the optional path parameter, it is used as the location where to store the processed data. If no path is provided, it is saved in a temporary directory. The processed data stored on disk can be loaded in a different session, by instantiating an EventDataset with the EventDataset.from_disk() class method.

With an EventTrainer and an EventDataset ready, you can call EventTrainer.train() to train the model.

Training behaves just like in the tabular case, with support for:

Dynamic batch sizing based on the provided available memory.
Validation via a Validation object.
Differential privacy.
Custom training hooks.
Multi-GPU training.

Refer to the tabular training section of this guide for full details.

from aindo.rdml.relational import RelationalData
from aindo.rdml.synth.event import EventDataset, EventModel, EventPreproc, EventTrainer
from aindo.rdml.synth import Validation

data = RelationalData(data=..., schema=...)
data_train, data_valid = data.split(ratio=0.1)

preproc = EventPreproc.from_schema(
    schema=data.schema,
    ord_cols={"visits": "date", "diagnosis": "date", ...},
    final_tables=["discharge"],
).fit(data=data)

model = EventModel.build(preproc=preproc, size="small")

dataset_train = EventDataset.from_data(data=data_train, preproc=preproc)
dataset_valid = EventDataset.from_data(data=data_valid, preproc=preproc)

trainer = EventTrainer(model=model)
trainer.train(
    dataset=dataset_train,
    n_epochs=100,
    batch_size=32,
    valid=Validation(dataset=dataset_valid, each=1, trigger="epoch"),
)

Saving and loading

Models and trainers can be saved using the EventModel.save() and EventTrainer.save() methods.

from aindo.rdml.synth.event import EventModel

model = EventModel.build(preproc=..., size=...)

# Train the event model
...

model.save(path="path/to/ckpt")

A saved model or trainer can be later loaded with the EventModel.load() and EventTrainer.load() methods.

from aindo.rdml.synth.event import EventModel

model = EventModel.load(path="path/to/ckpt")

data_synth = model.generate(
    n_samples=1_000,
    batch_size=256,
)

Tip

Similar to the tabular case, save the trainer only if you plan to resume training.

Tip

The saved EventTrainer contains the model (EventTrainer.model attribute), so saving both is redundant.

For more details, please see the saving and loading section of the tabular case.

Generation of event data

A trained EventModel can be used to generate synthetic event data using the EventModel.generate() method. The output is a RelationalData object containing the generated data.

Generation modes

To generate new event series from scratch, specify the number of samples using the n_samples parameter. In this case, the model generates n_samples synthetic event sequences, including rows for the root table.

Alternatively, you can guide the generation process by providing a context through the ctx parameter. The model will then perform conditional generation, starting from the provided context. The context can be of two types:

Partial Root Table: The context may include some or all columns from the root table. In this case, the model will complete the remaining columns of the root table and generate the associated event sequence. These context columns must be declared in the EventPreproc. As an example, to generate time series only for male patients, use as context column the gender column from the patient table, and provide as context a constant column with the value "male".
Partial Time Series: The context may include the full root table and partial sequences of events in one or more event tables. In this case, the model will continue generating events from where the sequences left off.

Stopping criteria

By default, the model learns when to stop generating an event sequence based on the training data. If a final event is generated, the sequence terminates automatically.

You can override this behavior with the following optional parameters of EventModel.generate():

max_n_events: The maximum number of events in the final sequence (including those in the context, if provided).
forbidden_events: A collection of event tables that should not be generated. This includes both final and non-final events. For example, consider two final tables, discharge and ongoing, with the latter containing the final events for the patients with ongoing treatments at the time of the data collection. If you only want to generate discharges, simply add ongoing to this list.
force_final (bool): If True, forces the sequence to end with a final event—unless max_n_events is reached first. This is useful when some sequences in the original data do not end in a final event, while the user needs to generate complete sequences.
force_event (bool): If True, the model continues generating non-final events until the max_n_events limit is reached.

Warning

A positive value for max_n_events is required when force_event is set to True.

Warning

force_final and force_event cannot both be set to True.

Additional parameters

There are two additional optional parameters, analogous to those in TabularModel.generate():

batch_size: The number of samples generated in parallel. Defaults to 0, which means all samples are generated in one batch.
temp: A strictly positive number (default 1) controlling the randomness of generation. Higher values increase variability, while lower values make the generation more deterministic.

from aindo.rdml.relational import RelationalData
from aindo.rdml.synth.event import EventModel

data = RelationalData(data=..., schema=...)
model = EventModel.build(preproc=..., size=...)

# Train the model
...

# Generate synthetic events from scratch
data_synth = model.generate(
    n_samples=200,
    batch_size=512,
)

# Continue the time series in the test set,
# until a final event is reached, or the series reaches 200 events
data_synth_continue = model.generate(
    ctx=data,
    force_final=True,
    max_n_events=200,
    batch_size=512,
)

Example: "what-if" scenarios

One powerful use case is the generation of "what-if" scenarios.

Imagine you have the medical records of a single patient up to a certain point in time, represented as an (incomplete) event series. You may want to explore the likelihood of various future outcomes if the sequence continues, and how such likelihood may change according to the possible next actions taken.

To do this, you can gather N identical copies of the patient’s timeline in a single dataset (with the patient table containing N identical rows) and perform a single generation using the partial sequences as context, and setting the desired options—e.g., with force_final=True if you're specifically interested in the discharge information.

The model will produce N plausible continuations of the patient’s medical history, each representing a different possible future. These can then be analyzed statistically to understand the range and distribution of potential outcomes.

Now, consider a second scenario where the event series is identical except for a single modification to the last event—perhaps a different diagnosis or procedure. This gives you two versions of the patient's timeline, each diverging at one key point.

By running the same multi-sample generation on both versions, you can compare the outcomes and assess the statistical impact of that one differing event on the patient’s future medical history.