Skip to content

Data preprocessing

As discussed in the introduction, event data must adhere to a specific structure:

  • Root Table: A table with a primary key and feature columns containing global information for each time series. In our example, the patient table holds the personal data of inpatients.

  • Event Tables: One or more tables with a single foreign key referencing the root table. In our case, the visits, diagnoses, procedures, and discharge tables all reference the patient table via a foreign key.

  • Final Event Tables (optional): Some event tables may be declared as final. A final table contains the last event for each time series, meaning only one row per foreign key value is allowed. For example, the discharge table is a final table, as each patient can only be discharged once, and no further medical events should occur afterward.

  • Order Columns: Each event table may have an order column that defines the sequence of events in the time series. These can be either numerical or datetime fields and are used to temporally order the events. Final tables do not require an order column, since their events are inherently considered the last in the series. On the contrary, non-final tables should have an order column to define the correct event sequence. An exception is when there is only one non-final event table; in this case, if no order column is provided, the original row order is used. In our example, all event tables include a date column, which serves as the order column.

Warning

All order columns in the dataset must be consistent—e.g., they should all be consistent dates with aligned date formats.

The Schema of such a data structure may be for instance:

from aindo.rdml.relational import Column, ForeignKey, PrimaryKey, Schema, Table

schema = Schema(
    patient=Table(
        patient_id=PrimaryKey(),
        birthdate=Column.DATE,
        gender=Column.CATEGORICAL,
        ...
    ),
    visits=Table(
        patient_id=ForeignKey(parent="patient"),
        date=Column.DATETIME,
        visit_type=Column.CATEGORICAL,
        ...
    ),
    daignosis=Table(
        patient_id=ForeignKey(parent="patient"),
        date=Column.DATETIME,
        code=Column.CATEGORICAL,
        description=Column.TEXT,
        severity=Column.CATEGORICAL,
        ...
    ),
    procedures=Table(
        patient_id=ForeignKey(parent="patient"),
        date=Column.DATETIME,
        type=Column.CATEGORICAL,
        outcome=Column.CATEGORICAL,
        ...
    ),
    discharge=Table(
        patient_id=ForeignKey(parent="patient"),
        date=Column.DATETIME,
        disposition=Column.CATEGORICAL,
        status=Column.CATEGORICAL,
        medications_prescribed=Column.CATEGORICAL,
        notes=Column.TEXT,
        ...
    ),
)

Once the data structure is defined, preprocessing works similarly to preprocessing relational tabular data. This is handled by the EventPreproc object.

To instantiate the default event preprocessor, use the EventPreproc.from_schema() class method, providing the data Schema.

The EventPreproc.from_schema() method has the following optional arguments:

  • ord_cols: A dictionary mapping table names to their respective order columns. Tables not listed are assumed not to have an order column. Final tables do not need order columns. As noted, if only one non-final table exists, it may omit the order column.
  • final_tables: A collection of table names that contain final events.
  • ctx_cols: A sequence of root table columns to be used as context. These columns will not be learned by the model and must be provided during generation.
  • preprocessors: Equivalent to the preprocessors argument in TabularPreproc.from_schema(). See this section for a detailed explanation of preprocessing directives.

Warning

The preprocessing of the order columns included in ord_cols cannot be customized with the preprocessors argument.

Just like the TabularPreproc, the EventPreproc must be fitted on a RelationalData object.

import pandas as pd

from aindo.rdml.relational import Column, ForeignKey, PrimaryKey, RelationalData, Schema, Table
from aindo.rdml.synth.event import EventPreproc

schema = Schema(
    patient=Table(
        patient_id=PrimaryKey(),
        ...
    ),
    visits=Table(
        patient_id=ForeignKey(parent="patient"),
        date=Column.DATETIME,
        ...
    ),
    ...
)
data = {
    "patient": pd.read_csv("path/tp/patient"),
    "visits": pd.read_csv("path/tp/visits"),
    ...
}

data = RelationalData(data=data, schema=schema)
preproc = EventPreproc.from_schema(
    schema=data.schema,
    ord_cols={"visits": "date", "diagnosis": "date", ...},
    final_tables=["discharge"],
)
preproc.fit(data=data)

Info

Internally, the Schema used by the event preprocessor differs from the one provided by the user. This internal schema includes adjustments for managing order columns. However, the generated data will always conform to the original Schema supplied by the user.