Data pre- and post-processing
The main strength of the aindo.rdml.synth.llm package is that it provides with classes
to preprocess relational datasets to feed them to an LLM, both for training and inference.
The underlying idea is to use the structured generation (also known as guided generation) feature of LLM inference to make the LLM generate text that can be interpreted as relational data. In particular, we make use of the JSON data format since:
- A wide class of relational data structures can be represented in JSON format.
- It is one of the most common format for LLM structured generation.
Restriction on the relational data structure
In order to effectively transform a relational dataset to JSON format, we restrict to relational datasets that have a tree structure. In other words, every table to be generated may have at most one foreign key referring to another table among the ones that has to be generated (excluding for instance lookup tables). With this hypothesis, it is possible to split the dataset into independent identically distributed samples, one for each row of the root table, which is usually the table including each individual's information.
The preprocessing of aindo.rdml.synth.llm transforms the relational data to a list
of JSON objects (as Python dictionaries), one for each independent sample of the dataset.
These JSON objects can be serialized to text, and used to train an LLM, which can learn their underlying structure.
Given a relational dataset, the aindo.rdml.synth.llm package also provides the tools
to build the JSON schema that describes a valid JSON object describing a single sample of
the dataset.
This JSON schema can be then passed to any LLM inference framework which supports structured JSON generation,
in order to force the LLM to generate valid JSON objects, that in turn represent valid dataset samples.
The post-processing routines finally take the JSON outputs of the LLM and reconstruct the relational data structure.
To start, the user must define a preprocessor, choosing the type according to the task at hand. There are two available preprocessors:
RelSynthPreproc: deals with relational data.RelEventPreproc: deals with event data (see this section for details on event data).
Relational data
An instance of RelSynthPreproc can be built with the
RelSynthPreproc.from_data() class method.
The preprocessor is built using the Schema of the input
RelationalData and then fitted to the actual data.
It is also possible to use separately the
RelSynthPreproc.from_schema() class method to initialize
the preprocessor from a Schema, and later fit it with the
RelSynthPreproc.fit() method.
The RelSynthPreproc.from_schema()
(respectively RelSynthPreproc.from_data()) class method
takes as input a Schema
(respectively RelationalData), which is used to build (and fit) a
TablePreproc for each dataset table.
In turn, each TablePreproc contains a
ColumnPreproc for each supported table column.
The default type of ColumnPreproc is chosen according to the
Column type declared in the Schema.
It is possible to customize the initialization of the RelSynthPreproc
with two optional arguments (both for RelSynthPreproc.from_data()
and for RelSynthPreproc.from_schema()):
-
config: A dictionary ofTableConfig, containing an optional table preprocessor configurations for the dataset tables. ATableConfigcontains:description: An optional description of the table, used in the JSON schema.column_descriptions: A dictionary of optional descriptions for the table columns, also used in the JSON schema.config: A dictionary of optionalpydantic.ConfigDictto configure the resulting JSON schema.
-
overwrites: A dictionary ofTablePreprocto overwrite the default one.
When providing a custom table preprocessor with the overwrites parameter, keep in mind that
TablePreproc objects can in turn be built with the
TablePreproc.from_table() class method,
starting from a Table object.
It also has a config optional parameter, and an overwrites parameter, which can be used to
specify custom ColumnPreproc instances for specific columns.
Given a Column type, the default ColumnPreproc
type for that Column can be obtained with the
TablePreproc.get_default_column() class method.
Each column preprocessor can be initialized with:
name: The name of the column (used in the JSON schema).description: The description of the column (used in the JSON schema).special_values: A sequence of values to be treated separately as categories (not available forCategorical).protection: AProtectionconfiguration with an optional list ofDetector's and adefaultoption to enable the defaultDetector's. Optionally, a boolean flag can be provided to theprotectionparameter in place of theProtectionconfiguration, to indicate whether the default protection should be enabled (short forProtection(default=True)).
The available column preprocessors are the following:
-
Categorical: A preprocessor for categorical columns. -
Boolean: A preprocessor for boolean columns. -
Coordinates: A preprocessor for columns containing geographic coordinates. It takes as optional argumentscfg_latandcfg_lon, both accepting an optionalNumCfgconfiguration. The latter includes configurablelower_bound,upper_boundandmultiple_ofoptions, that are used to build the JSON schema. Lower and upper bounds can be provided, but also inferred directly from the data. -
DateTime: A preprocessor for columns with datetime data. Thefmtoptional parameter can be used to explicitly set the datetime format, otherwise it will be inferred from the data. -
Date: A specific datetime column with only dates. -
Time: A specific datetime column with only times. -
Numeric: A preprocessor for columns with numeric values. The user can provide aNumCfgto the optionalcfgparameter. -
Integer: A preprocessor for columns with integer values. Also supports the samecfgparameter as does theNumericcolumn. -
Text: A preprocessor for columns with free text. The optionalcfgparameter takes as input an optionalTextCfgconfiguration. The latter containsmin_lengthandmax_lengthoptions, to regulate the length of the generated text, and apatternoption to constraint the generated text with a regex expression.
In the following example, we initialize a RelSynthPreproc
and fit it to a relational dataset.
Once again, we use as an example the BasketballMen dataset.
In the example section we also present a full example
using the same dataset.
from aindo.rdml.relational import RelationalData, Schema, Table
from aindo.rdml.synth.llm import RelSynthPreproc
data = RelationalData(
data={"players": ..., "season": ..., "all_star": ...},
schema=Schema(
players=Table(...),
season=Table(...),
all_star=Table(...),
),
)
preproc = RelSynthPreproc.from_data(data=data)
Event data
A RelEventPreproc is used for event data,
and is built in the same way as the RelSynthPreproc, with the
RelEventPreproc.from_data() class method, or using
RelEventPreproc.from_schema() and then
RelEventPreproc.fit().
These methods have the same config and overwrites optional parameters as their corresponding ones in
RelSynthPreproc.
There is one additional parameter, ord_cols, which must be used to select the order column of each event table.
For more information on order columns, check the event data preprocessing section.
The ord_cols parameter has the same meaning as the homonym parameter in the
EventPreproc.from_schema() method.
Warning
Notice that, unlike the EventPreproc object,
it is not possible to specify final tables, namely tables whose events are always last in each individual's
time series (for more details, see the event data section).
Column data protection
As for the preprocessing for neural models, it is possible to add an extra layer of privacy protection by protecting the rare values in the column data.
This is achieved passing a Protection configuration or a boolean flag to the
protection parameter of a custom ColumnPreproc.
To protect the rare values of some columns it is necessary to instantiate the custom column preprocessors,
and then build the RelSynthPreproc
(or the RelEventPreproc) using the overwrites parameter.
Then the data can be protected with the RelSynthPreproc.protect()
method, which masks all the detected rare values.
The preprocessor should then be fitted with the protected data.
In the following example, we turn on the default protection for the college column in the players table,
while keeping the default column preprocessor type:
from aindo.rdml.relational import RelationalData, Schema, Table, Column
from aindo.rdml.synth.llm import RelSynthPreproc, TablePreproc, Protection
data = RelationalData(
data={"players": ..., "season": ..., "all_star": ...},
schema=Schema(
players=Table(college=Column.CATEGORICAL, ...),
season=Table(...),
all_star=Table(...),
),
)
table = data.schema.tables["players"]
column_type = TablePreproc.get_default_column(table.columns["college"])
preproc_col = column_type(protection=Protection(default=True))
preproc_table = TablePreproc.from_table(
table=table,
overwrites={"college": preproc_col},
)
preproc = RelSynthPreproc.from_schema(
schema=data.schema,
overwrites={"players": preproc_table},
)
data_protected = preproc.protect(data=data)
preproc.fit(data=data_protected)
Info
Notice that, unlike the Protection configuration for standard neural models,
in the Protection configuration for LLM generation, there is no
type option.
The choice of masking or imputing the protected values is postponed to the generation phase,
and is performed with the column generation directive ColumnDirective.