Data pre- and post-processing

The main strength of the aindo.rdml.synth.llm package is that it provides with classes to preprocess relational datasets to feed them to an LLM, both for training and inference.

The underlying idea is to use the structured generation (also known as guided generation) feature of LLM inference to make the LLM generate text that can be interpreted as relational data. In particular, we make use of the JSON data format since:

A wide class of relational data structures can be represented in JSON format.
It is one of the most common format for LLM structured generation.

Restriction on the relational data structure

In order to effectively transform a relational dataset to JSON format, we restrict to relational datasets that have a tree structure. In other words, every table to be generated may have at most one foreign key referring to another table among the ones that has to be generated (excluding for instance lookup tables). With this hypothesis, it is possible to split the dataset into independent identically distributed samples, one for each row of the root table, which is usually the table including each individual's information.

The preprocessing of aindo.rdml.synth.llm transforms the relational data to a list of JSON objects (as Python dictionaries), one for each independent sample of the dataset. These JSON objects can be serialized to text, and used to train an LLM, which can learn their underlying structure. Given a relational dataset, the aindo.rdml.synth.llm package also provides the tools to build the JSON schema that describes a valid JSON object describing a single sample of the dataset. This JSON schema can be then passed to any LLM inference framework which supports structured JSON generation, in order to force the LLM to generate valid JSON objects, that in turn represent valid dataset samples. The post-processing routines finally take the JSON outputs of the LLM and reconstruct the relational data structure.

To start, the user must define a preprocessor, choosing the type according to the task at hand. There are two available preprocessors:

RelSynthPreproc: deals with relational data.
RelEventPreproc: deals with event data (see this section for details on event data).

Relational data

An instance of RelSynthPreproc can be built with the RelSynthPreproc.from_data() class method. The preprocessor is built using the Schema of the input RelationalData and then fitted to the actual data. It is also possible to use separately the RelSynthPreproc.from_schema() class method to initialize the preprocessor from a Schema, and later fit it with the RelSynthPreproc.fit() method.

The RelSynthPreproc.from_schema() (respectively RelSynthPreproc.from_data()) class method takes as input a Schema (respectively RelationalData), which is used to build (and fit) a TablePreproc for each dataset table. In turn, each TablePreproc contains a ColumnPreproc for each supported table column. The default type of ColumnPreproc is chosen according to the Column type declared in the Schema.

It is possible to customize the initialization of the RelSynthPreproc with two optional arguments (both for RelSynthPreproc.from_data() and for RelSynthPreproc.from_schema()):

config: A dictionary of TableConfig, containing an optional table preprocessor configurations for the dataset tables. A TableConfig contains:
- description: An optional description of the table, used in the JSON schema.
- column_descriptions: A dictionary of optional descriptions for the table columns, also used in the JSON schema.
- config: A dictionary of optional pydantic.ConfigDict to configure the resulting JSON schema.
overwrites: A dictionary of TablePreproc to overwrite the default one.

When providing a custom table preprocessor with the overwrites parameter, keep in mind that TablePreproc objects can in turn be built with the TablePreproc.from_table() class method, starting from a Table object. It also has a config optional parameter, and an overwrites parameter, which can be used to specify custom ColumnPreproc instances for specific columns. Given a Column type, the default ColumnPreproc type for that Column can be obtained with the TablePreproc.get_default_column() class method.

Each column preprocessor can be initialized with:

name: The name of the column (used in the JSON schema).
description: The description of the column (used in the JSON schema).
special_values: A sequence of values to be treated separately as categories (not available for Categorical).
protection: A Protection configuration with an optional list of Detector's and a default option to enable the default Detector's. Optionally, a boolean flag can be provided to the protection parameter in place of the Protection configuration, to indicate whether the default protection should be enabled (short for Protection(default=True)).

The available column preprocessors are the following:

Categorical: A preprocessor for categorical columns.
Boolean: A preprocessor for boolean columns.
Coordinates: A preprocessor for columns containing geographic coordinates. It takes as optional arguments cfg_lat and cfg_lon, both accepting an optional NumCfg configuration. The latter includes configurable lower_bound, upper_bound and multiple_of options, that are used to build the JSON schema. Lower and upper bounds can be provided, but also inferred directly from the data.
DateTime: A preprocessor for columns with datetime data. The fmt optional parameter can be used to explicitly set the datetime format, otherwise it will be inferred from the data.
Date: A specific datetime column with only dates.
Time: A specific datetime column with only times.
Numeric: A preprocessor for columns with numeric values. The user can provide a NumCfg to the optional cfg parameter.
Integer: A preprocessor for columns with integer values. Also supports the same cfg parameter as does the Numeric column.
Text: A preprocessor for columns with free text. The optional cfg parameter takes as input an optional TextCfg configuration. The latter contains min_length and max_length options, to regulate the length of the generated text, and a pattern option to constraint the generated text with a regex expression.

In the following example, we initialize a RelSynthPreproc and fit it to a relational dataset. Once again, we use as an example the BasketballMen dataset. In the example section we also present a full example using the same dataset.

from aindo.rdml.relational import RelationalData, Schema, Table
from aindo.rdml.synth.llm import RelSynthPreproc

data = RelationalData(
    data={"players": ..., "season": ..., "all_star": ...},
    schema=Schema(
        players=Table(...),
        season=Table(...),
        all_star=Table(...),
    ),
)
preproc = RelSynthPreproc.from_data(data=data)

Event data

A RelEventPreproc is used for event data, and is built in the same way as the RelSynthPreproc, with the RelEventPreproc.from_data() class method, or using RelEventPreproc.from_schema() and then RelEventPreproc.fit().

These methods have the same config and overwrites optional parameters as their corresponding ones in RelSynthPreproc. There is one additional parameter, ord_cols, which must be used to select the order column of each event table. For more information on order columns, check the event data preprocessing section. The ord_cols parameter has the same meaning as the homonym parameter in the EventPreproc.from_schema() method.

Warning

Notice that, unlike the EventPreproc object, it is not possible to specify final tables, namely tables whose events are always last in each individual's time series (for more details, see the event data section).

Column data protection

As for the preprocessing for neural models, it is possible to add an extra layer of privacy protection by protecting the rare values in the column data.

This is achieved passing a Protection configuration or a boolean flag to the protection parameter of a custom ColumnPreproc.

To protect the rare values of some columns it is necessary to instantiate the custom column preprocessors, and then build the RelSynthPreproc (or the RelEventPreproc) using the overwrites parameter. Then the data can be protected with the RelSynthPreproc.protect() method, which masks all the detected rare values. The preprocessor should then be fitted with the protected data.

In the following example, we turn on the default protection for the college column in the players table, while keeping the default column preprocessor type:

from aindo.rdml.relational import RelationalData, Schema, Table, Column
from aindo.rdml.synth.llm import RelSynthPreproc, TablePreproc, Protection

data = RelationalData(
    data={"players": ..., "season": ..., "all_star": ...},
    schema=Schema(
        players=Table(college=Column.CATEGORICAL, ...),
        season=Table(...),
        all_star=Table(...),
    ),
)
table = data.schema.tables["players"]
column_type = TablePreproc.get_default_column(table.columns["college"])
preproc_col = column_type(protection=Protection(default=True))
preproc_table = TablePreproc.from_table(
    table=table,
    overwrites={"college": preproc_col},
)
preproc = RelSynthPreproc.from_schema(
    schema=data.schema,
    overwrites={"players": preproc_table},
)
data_protected = preproc.protect(data=data)
preproc.fit(data=data_protected)

Info

Notice that, unlike the Protection configuration for standard neural models, in the Protection configuration for LLM generation, there is no type option. The choice of masking or imputing the protected values is postponed to the generation phase, and is performed with the column generation directive ColumnDirective.