Event model
The aindo.rdml.synth.event
module enables the generation of synthetic event data.
Event data are a collection of time series composed of various types of events.
To illustrate, consider the example of hospitalizations for a collection of inpatients at a particular hospital. Each patient's medical history includes a sequence of events—such as visits, diagnosis, and procedures—often ending with a discharge event, though some may still be ongoing at the time of data collection. These events are chronologically ordered, typically associated with specific dates and times.
Event data can be structured in a relational format consisting of:
-
A root table, which contains general information about each individual time series. In our example, this would be the
patient
table, storing personal information about the inpatients. -
One or more event tables, each capturing a specific type of event related to individuals in the root table. For patient medical histories, typical event table might include:
visits
: Records of each patient's medical visits.diagnosis
: Diagnoses recorded for each patient.procedures
: Medical procedures performed on each patient.- ... and so on.
-
Certain event tables are designated as final events, meaning each individual can have at most one such event, and it must occur last in their time series. In our hospitalization example, the
discharge
table represents such a final event.
The goal of the aindo.rdml.synth.event
module is to train a model on this type of event data
and use it to generate realistic synthetic event series.
While the underlying model is similar to that used for generating synthetic relational data,
it includes additional capabilities specifically designed for time-ordered event generation.
Consider the following key use cases:
-
Full generation: Create N entirely new time series that replicate the global statistical patterns of the original dataset. This is akin to standard relational synthetic data generation, and is useful for generating fully synthetic patient histories.
-
Continuation: Extend partial time series for a set of individuals in a statistically consistent manner. This is useful for modeling the future progression of, for instance, an ongoing hospitalization.
-
Controlled termination: When generating or extending time series, users can specify a stopping condition, such as requiring that the sequence ends with a final event—e.g., a discharge event. This ensures all generated medical histories are complete, even if the original data include ongoing cases.
-
Length control: Alternatively, the user can instruct the model to keep generating events up to a specified number.
In the Event script, we provide a full example of how to use
the aindo.rdml.synth.event
module to train a model and generate either new time series
or continuations of existing ones.
For demonstration, we use the BasketballMen dataset, which includes:
- A root table,
players
, containing information about NBA players. - Two event tables:
season
: Details about the seasons played by each player.all_star
: Information about the all-star seasons of each player.
Events in both the event tables are ordered chronologically based on a column containing the season year.
By comparing the event script with the standard relational script, users can better understand both the similarities and differences between the two approaches. Although the workflows are similar (define a preprocessor, build a model, train it, and generate data), the core distinction lies in the use case:
-
In the tabular case, the goal is to generate a new dataset with entirely new basketball players. Optionally, some columns from the original dataset may serve as the starting point of the generation. Here, the context is "vertical": the user must provide the full context columns, across all rows of each table, spanning it "vertically".
-
In the event case, the model may either generate a new dataset or extend an existing one by continuing each individual's time series. In this setting, the context is "horizontal": the user supplies the first N events for each individual, and these are full, "horizontal" rows, with all columns filled in. Figure 1 illustrates the difference between (a) "vertical" and (b) "horizontal" context. Additionally, with the event generation model, the user has explicit control over the stopping criteria.
Figure 1. Generation from (a) a "vertical" context in the tabular case, and from (b) a "horizontal" context in the event case.