Data preprocessing
As discussed in the introduction, event data must adhere to a specific structure:
-
Root Table: A table with a primary key and feature columns containing global information for each time series. In our example, the
patient
table holds the personal data of inpatients. -
Event Tables: One or more tables with a single foreign key referencing the root table. In our case, the
visits
,diagnoses
,procedures
, anddischarge
tables all reference thepatient
table via a foreign key. -
Final Event Tables (optional): Some event tables may be declared as final. A final table contains the last event for each time series, meaning only one row per foreign key value is allowed. For example, the
discharge
table is a final table, as each patient can only be discharged once, and no further medical events should occur afterward. -
Order Columns: Each event table may have an order column that defines the sequence of events in the time series. These can be either numerical or datetime fields and are used to temporally order the events. Final tables do not require an order column, since their events are inherently considered the last in the series. On the contrary, non-final tables should have an order column to define the correct event sequence. An exception is when there is only one non-final event table; in this case, if no order column is provided, the original row order is used. In our example, all event tables include a
date
column, which serves as the order column.
Warning
All order columns in the dataset must be consistent—e.g., they should all be consistent dates with aligned date formats.
The Schema
of such a data structure may be for instance:
from aindo.rdml.relational import Column, ForeignKey, PrimaryKey, Schema, Table
schema = Schema(
patient=Table(
patient_id=PrimaryKey(),
birthdate=Column.DATE,
gender=Column.CATEGORICAL,
...
),
visits=Table(
patient_id=ForeignKey(parent="patient"),
date=Column.DATETIME,
visit_type=Column.CATEGORICAL,
...
),
daignosis=Table(
patient_id=ForeignKey(parent="patient"),
date=Column.DATETIME,
code=Column.CATEGORICAL,
description=Column.TEXT,
severity=Column.CATEGORICAL,
...
),
procedures=Table(
patient_id=ForeignKey(parent="patient"),
date=Column.DATETIME,
type=Column.CATEGORICAL,
outcome=Column.CATEGORICAL,
...
),
discharge=Table(
patient_id=ForeignKey(parent="patient"),
date=Column.DATETIME,
disposition=Column.CATEGORICAL,
status=Column.CATEGORICAL,
medications_prescribed=Column.CATEGORICAL,
notes=Column.TEXT,
...
),
)
Once the data structure is defined, preprocessing works similarly to preprocessing relational tabular data.
This is handled by the EventPreproc
object.
To instantiate the default event preprocessor, use the
EventPreproc.from_schema()
class method,
providing the data Schema
.
The EventPreproc.from_schema()
method has
the following optional arguments:
ord_cols
: A dictionary mapping table names to their respective order columns. Tables not listed are assumed not to have an order column. Final tables do not need order columns. As noted, if only one non-final table exists, it may omit the order column.final_tables
: A collection of table names that contain final events.ctx_cols
: A sequence of root table columns to be used as context. These columns will not be learned by the model and must be provided during generation.preprocessors
: Equivalent to thepreprocessors
argument inTabularPreproc.from_schema()
. See this section for a detailed explanation of preprocessing directives.
Warning
The preprocessing of the order columns included in ord_cols
cannot be customized
with the preprocessors
argument.
Just like the TabularPreproc
, the
EventPreproc
must be fitted on a
RelationalData
object.
import pandas as pd
from aindo.rdml.relational import Column, ForeignKey, PrimaryKey, RelationalData, Schema, Table
from aindo.rdml.synth.event import EventPreproc
schema = Schema(
patient=Table(
patient_id=PrimaryKey(),
...
),
visits=Table(
patient_id=ForeignKey(parent="patient"),
date=Column.DATETIME,
...
),
...
)
data = {
"patient": pd.read_csv("path/tp/patient"),
"visits": pd.read_csv("path/tp/visits"),
...
}
data = RelationalData(data=data, schema=schema)
preproc = EventPreproc.from_schema(
schema=data.schema,
ord_cols={"visits": "date", "diagnosis": "date", ...},
final_tables=["discharge"],
)
preproc.fit(data=data)