Event generation
To train and generate synthetic event data, the aindo.rdml.synth.eval
package provides
the EventModel
class.
Similar to a TabularModel
, an EventModel
should be built, trained, and then used for data generation.
It can also be saved and later loaded for continued use in another session.
Building the model
An EventModel
is instantiated using the
EventModel.build()
class method.
You must provide:
-
An
EventPreproc
object. -
The model
size
, which can be aTabularModelSize
, aSize
, or a string equivalent of the latter. -
Optional arguments:
block
(either"free"
[default] or"lstm"
), and dropout.
These arguments mirror those in TabularModel.build()
.
from aindo.rdml.relational import RelationalData
from aindo.rdml.synth.event import EventModel, EventPreproc
data = RelationalData(data=..., schema=...)
preproc = EventPreproc.from_schema(
schema=data.schema,
ord_cols={"visits": "date", "diagnosis": "date", ...},
final_tables=["discharge"],
).fit(data=data)
model = EventModel.build(preproc=preproc, size="small")
Training the model
To train an EventModel
,
use the EventTrainer
class.
Training requires an EventDataset
, which is
created from the training data and the same preprocessor used for building the model.
Use the EventDataset.from_data()
class method
to process the data and create the dataset.
By default, data is stored in memory.
Alternatively, to save it on disk, set on_disk=True
.
If a path is provided to the optional path
parameter, it is used as the location where to store the processed data.
If no path
is provided, it is saved in a temporary directory.
The processed data stored on disk can be loaded in a different session, by instantiating an
EventDataset
with the
EventDataset.from_disk()
class method.
With an EventTrainer
and an
EventDataset
ready,
you can call EventTrainer.train()
to train the model.
Training behaves just like in the tabular case, with support for:
- Dynamic batch sizing based on the provided available memory.
- Validation via a
Validation
object. - Differential privacy.
- Custom training hooks.
- Multi-GPU training.
Refer to the tabular training section of this guide for full details.
from aindo.rdml.relational import RelationalData
from aindo.rdml.synth.event import EventDataset, EventModel, EventPreproc, EventTrainer
from aindo.rdml.synth import Validation
data = RelationalData(data=..., schema=...)
data_train, data_valid = data.split(ratio=0.1)
preproc = EventPreproc.from_schema(
schema=data.schema,
ord_cols={"visits": "date", "diagnosis": "date", ...},
final_tables=["discharge"],
).fit(data=data)
model = EventModel.build(preproc=preproc, size="small")
dataset_train = EventDataset.from_data(data=data_train, preproc=preproc)
dataset_valid = EventDataset.from_data(data=data_valid, preproc=preproc)
trainer = EventTrainer(model=model)
trainer.train(
dataset=dataset_train,
n_epochs=100,
batch_size=32,
valid=Validation(dataset=dataset_valid, each=1, trigger="epoch"),
)
Saving and loading
Models and trainers can be saved using the EventModel.save()
and
EventTrainer.save()
methods.
from aindo.rdml.synth.event import EventModel
model = EventModel.build(preproc=..., size=...)
# Train the event model
...
model.save(path="path/to/ckpt")
A saved model or trainer can be later loaded with the EventModel.load()
and
EventTrainer.load()
methods.
from aindo.rdml.synth.event import EventModel
model = EventModel.load(path="path/to/ckpt")
data_synth = model.generate(
n_samples=1_000,
batch_size=256,
)
Tip
Similar to the tabular case, save the trainer only if you plan to resume training.
Tip
The saved EventTrainer
contains the model
(EventTrainer.model
attribute), so saving both is redundant.
For more details, please see the saving and loading section of the tabular case.
Generation of event data
A trained EventModel
can be used to generate synthetic event data
using the EventModel.generate()
method.
The output is a RelationalData
object containing the generated data.
Generation modes
To generate new event series from scratch, specify the number of samples using the n_samples
parameter.
In this case, the model generates n_samples
synthetic event sequences, including rows for the root table.
Alternatively, you can guide the generation process by providing a context through the ctx
parameter.
The model will then perform conditional generation, starting from the provided context.
The context can be of two types:
-
Partial Root Table: The context may include some or all columns from the root table. In this case, the model will complete the remaining columns of the root table and generate the associated event sequence. These context columns must be declared in the
EventPreproc
. As an example, to generate time series only for male patients, use as context column thegender
column from thepatient
table, and provide as context a constant column with the value"male"
. -
Partial Time Series: The context may include the full root table and partial sequences of events in one or more event tables. In this case, the model will continue generating events from where the sequences left off.
Stopping criteria
By default, the model learns when to stop generating an event sequence based on the training data. If a final event is generated, the sequence terminates automatically.
You can override this behavior with the following optional parameters
of EventModel.generate()
:
max_n_events
: The maximum number of events in the final sequence (including those in the context, if provided).forbidden_events
: A collection of event tables that should not be generated. This includes both final and non-final events. For example, consider two final tables,discharge
andongoing
, with the latter containing the final events for the patients with ongoing treatments at the time of the data collection. If you only want to generate discharges, simply addongoing
to this list.force_final
(bool): IfTrue
, forces the sequence to end with a final event—unlessmax_n_events
is reached first. This is useful when some sequences in the original data do not end in a final event, while the user needs to generate complete sequences.force_event
(bool): IfTrue
, the model continues generating non-final events until themax_n_events
limit is reached.
Warning
A positive value for max_n_events
is required when force_event
is set to True
.
Warning
force_final
and force_event
cannot both be set to True
.
Additional parameters
There are two additional optional parameters, analogous to those in
TabularModel.generate()
:
batch_size
: The number of samples generated in parallel. Defaults to 0, which means all samples are generated in one batch.temp
: A strictly positive number (default 1) controlling the randomness of generation. Higher values increase variability, while lower values make the generation more deterministic.
from aindo.rdml.relational import RelationalData
from aindo.rdml.synth.event import EventModel
data = RelationalData(data=..., schema=...)
model = EventModel.build(preproc=..., size=...)
# Train the model
...
# Generate synthetic events from scratch
data_synth = model.generate(
n_samples=200,
batch_size=512,
)
# Continue the time series in the test set,
# until a final event is reached, or the series reaches 200 events
data_synth_continue = model.generate(
ctx=data,
force_final=True,
max_n_events=200,
batch_size=512,
)
Example: "what-if" scenarios
One powerful use case is the generation of "what-if" scenarios.
Imagine you have the medical records of a single patient up to a certain point in time, represented as an (incomplete) event series. You may want to explore the likelihood of various future outcomes if the sequence continues, and how such likelihood may change according to the possible next actions taken.
To do this, you can gather N identical copies of the patient’s timeline in a single dataset
(with the patient
table containing N identical rows) and perform a single generation
using the partial sequences as context, and setting the desired options—e.g.,
with force_final=True
if you're specifically interested in the discharge information.
The model will produce N plausible continuations of the patient’s medical history, each representing a different possible future. These can then be analyzed statistically to understand the range and distribution of potential outcomes.
Now, consider a second scenario where the event series is identical except for a single modification to the last event—perhaps a different diagnosis or procedure. This gives you two versions of the patient's timeline, each diverging at one key point.
By running the same multi-sample generation on both versions, you can compare the outcomes and assess the statistical impact of that one differing event on the patient’s future medical history.