Tasks

The aindo.rdml.synth.llm module allows the user to tackle several tasks. The list of available tasks can be found in the Task enumeration.

For each Task element, there is a BaseTask object available, which provides the necessary pre- and post-processing utilities. All BaseTask objects are built starting form one of the preprocessors described in the preprocessing section. Optionally the user can provide a description, that can be used to build the prompts for training or inference. Each BaseTask object provides the following method:

get_dataset(): Given a RelationalData as input, it returns an iterator of DatasetElem, one for each data sample. A DatasetElem contains the data to build a training example to fine-tune an LLM, including the JSON schema of the full relational dataset, the context data in JSON format, the target data in JSON format, and the JSON schema of the target data.
prompt(): This method returns an iterator of RelGenPrompt, Each RelGenPrompt contains the data to build a prompt for a single dataset sample. It includes the JSON schema of the full relational dataset, the context data in JSON format, the JSON schema of the target data, and a pydantic.BaseModel that can be used to validate the output data obtained from the LLM.
inverse_transform(): This routine can be used to transform back the output of the LLM into the original relational data format.
generate(): This method essentially combines the prompt() method, the LLM generation with a Generator and a provided Engine (see the section about LLM generation for further details), and the inverse_transform() method. For all tasks, the generate() method takes as input:
- cfg_gen: The generation configuration, an instance of GenConfig, which is used to set up the Generator and perform the generation with the Generator.generate() method. All the parameters of GenConfig have the same meaning as the parameters used in the Generator and detailed in the generation section.
- ctx: The context data, it may have different uses according to the task at hand.
- output_dir: Optional output directory for generation logs.
- engine_args: Optional CLI arguments to be passed to the generation engine.

Synthetic data tasks

Synthetic data tasks are built from a RelSynthPreproc.

Synthetic data generation

The RelSynth task is used to generate synthetic data, either from scratch or with a subset of the root table columns as context.

Beyond the preprocessor and the optional description, the user can optionally specify the context columns by passing the list of column names to the cols_ctx parameter.

The RelSynth.prompt() method takes the following extra arguments:

n_samples: When generating fully synthetic data, and no context is needed, it can be used to specify the number of samples (rows of the root column) to generate.
directives: A GenDirective dictionary with the generation directives for each table. The latter are provided as TableDirective objects, which in turn are dictionaries of ColumnDirective objects, containing the generation options for each column:
- impute_na: Whether to impute NA values for that column.
- impute_protected: Whether to impute the protected values, if the protection was active for that column. If False, the synthetic data may contain masked values in place of the rare ones.
- mask: Some extra values to avoid sampling when generating synthetic data. They can be among the categories of a categorical column, or among the declared special_values for the other column types.

The RelSynth.inverse_transform() method takes as input an optional iterable with the context in JSON format (for example as available in RelGenPrompt.ctx), an iterable with the JSON data generated by the LLM, and an optional context. It returns an instance of RelationalData with the generated synthetic data.

Finally, the RelSynth.generate() method will provide a list with a single element corresponding to the generated synthetic data. Beyond the parameters discussed previously, the user may also provide n_samples and directives, which are passed to the RelSynth.prompt() method.

As an example, we show how to initialize a RelSynth task, and how to use it to generate synthetic data, using a single root table column as context. Moreover, for presentation purposes, we also set up the generation to avoid sampling NA values in the race column of the players table.

import pandas as pd

from aindo.rdml.relational import RelationalData
from aindo.rdml.synth.llm import (
    RelSynthPreproc,
    RelSynth,
    GenConfig,
    GenDirective,
    TableDirective,
    ColumnDirective,
)

data: RelationalData = ...
preproc = RelSynthPreproc.from_data(data=data)
task = RelSynth(preproc=preproc, cols_ctx=["pos"])
cfg_gen = GenConfig(
    model="your-favourite-model",  # optionally fine-tuned on the synthetic data generation task
    prompt_template="Data schema: {schema}, context: {ctx}.\nSynthetic data:",
    # may depend on the model choice (available special tokens) and the task at hand
)
data_synth, = task.generate(
    cfg_gen=cfg_gen,
    ctx={"players": pd.DataFrame({"pos": pd.Series(...)})},  # use `n_samples` if there is no context
    directives=GenDirective({
        "players": TableDirective({
            "race": ColumnDirective(impute_na=True)
        })
    }),
)

The same result can be obtained using the RelSynth.prompt() and the RelSynth.inverse_transform() methods, by explicitly taking care of the generation process. The latter part can be done with the tools provided by aindo.rdml.synth.llm, which are described in the section about generation.

import pandas as pd

from aindo.rdml.relational import RelationalData
from aindo.rdml.synth.llm import (
    RelSynthPreproc,
    RelSynth,
    Generator,
    Engine,
    GenDirective,
    TableDirective,
    ColumnDirective,
)

data: RelationalData = ...
preproc = RelSynthPreproc.from_data(data=data)
task = RelSynth(preproc=preproc, cols_ctx=["pos"])

# Get the data needed to build the prompts.
ctx = {"players": pd.DataFrame({"pos": pd.Series(...)})}
prompts = task.prompt(
    ctx=ctx,
    directives=GenDirective({
        "players": TableDirective({
            "race": ColumnDirective(impute_na=True)
        })
    }),
)

# Generate the data, using an LLM. For example, you can use a `Generator` object.
generator = Generator.from_engine(
    engine=Engine.VLLM,
    model="your-favourite-model",
)
data_valid, _ = generator.generate(
    prompt_template="Data schema: {schema}, context: {ctx}.\nSynthetic data:",
    prompts=prompts,
    n=1,  # one generation per prompt
)

# Inverse transform
x = [p.ctx for p, d in zip(prompts, data_valid, strict=True) if d]  # get the ctx used in the prompts
data_valid = [d[0] for d in data_valid if d]  # there is one generated output per prompt
data_synth = task.inverse_transform(
    x=x,
    y=data_valid,
    ctx=ctx,
    progress=len(data_valid),
)

Semisynthetic data generation

With the RelSemiSynth task it is possible to generate synthetic data injecting some random component of the real data as context for the generation.

The more of the original data is used as context, the highest the similarity between the original and the synthetic data. However, by injecting too much of the original data, it is possible to incur in some degradation of the privacy protection guarantees. The user can obtain the desired similarity-privacy balance by tuning the amount of real data used as context.

When instantiating a RelSemiSynth object, it is necessary to specify the following parameter:

p_field: A float between 0 and 1, indicating the probability that for each row, any given column field is included in the context.
p_child: A float between 0 and 1, indicating the probability that for each row, all the rows of any given child table with a foreign key referring to that row, are included in the context. If they are, their fields are then selected for the context using p_field. Otherwise, all the child rows are generated, together with their own child rows, and so on. In other words, the whole relational structure pertaining to that original row will be fully synthetic.

Optionally the user can also specify:

ctx_as_const: Whether to insert the selected context in the target JSON schema, as constant values.
rng_train: A numpy.random.Generator or an integer seed to fix the randomness of the context selection during training.

The RelSemiSynth.prompt() method allows the user to pass the data that is used to sample the random context, and optionally fix the randomness with the rng parameter.

The RelSemiSynth.inverse_transform() method requires an iterable with the context in JSON form and the JSON objects generated by the LLM, and returns a RelationalData with the semisynthetic data.

Finally, the RelSemiSynth.generate() method returns a list with a single element with the generated semisynthetic data. The user can specify the randomness used in generating the prompts with the rng parameter.

import pandas as pd

from aindo.rdml.relational import RelationalData
from aindo.rdml.synth.llm import RelSynthPreproc, RelSemiSynth, GenConfig

data: RelationalData = ...
preproc = RelSynthPreproc.from_data(data=data)
task = RelSemiSynth(preproc=preproc, p_field=0.3, p_child=0.3)
cfg_gen = GenConfig(
    model="your-favourite-model",
    prompt_template="Data schema: {schema}, context: {ctx}.\nSynthetic data:",
    # may depend on the model choice (available special tokens) and the task at hand
)
data_synth, = task.generate(
    cfg_gen=cfg_gen,
    ctx=data,  # extract the context randomly from the original data
)

Prediction

The RelPredict task refers to the prediction of a subset of the relational dataset columns (the target columns) from the remaining ones (the context columns). This is identical in spirit to the predictive mode available for neural models.

To define a RelPredict object, the user must specify the target columns with the cols_tgt parameter, as a dictionary providing the list of target columns for each concerned table.

To use the RelPredict.prompt() method, the user must provide the context containing the context columns, and optionally a set of generation directives, in the same fashion as those discussed for the RelSynth.prompt() method. The latter may only contain directives for the target columns, which are the ones being generated.

The RelPredict.inverse_transform() method requires the same context, and the output JSON generated by the LLM, as an iterable of n_pred predictions, each consisting of an iterables of JSON objects, one for each context sample. The output is a list of n_pred predictions in the form of RelationalData objects, including only the target columns.

The RelPredict.generate() method also requires the context columns. The user can then specify the number of predictions n_pred and the generation directives for the target columns. The output is a list of n_pred RelationalData objects, with the generated target columns.

import pandas as pd

from aindo.rdml.relational import RelationalData
from aindo.rdml.synth.llm import RelSynthPreproc, RelPredict, GenConfig

data: RelationalData = ...
preproc = RelSynthPreproc.from_data(data=data)
cols_tgt = {
    "season": ["points", "assists", "steals"],
    "all_star": ["points", "rebounds", "assists", "blocks"],
}
task = RelPredict(preproc=preproc, cols_tgt=cols_tgt)
cfg_gen = GenConfig(
    model="your-favourite-model",
    prompt_template="Context: {ctx}, output schema: {out_schema}.\nPrediction:",
    # may depend on the model choice (available special tokens) and the task at hand
)
ctx = {"players": pd.DataFrame(...), "season": pd.DataFrame(...), "all_star": pd.DataFrame(...)},
# all columns but the target ones must be specified in teh context
pred = task.generate(
    cfg_gen=cfg_gen,
    ctx=ctx,
    n_pred=100,  # output is a list of 100 predictions
)

Event data tasks

Event data tasks use a RelEventPreproc.

Event data generation

The RelEvent task is similar to the RelSynth task, but specifically for event data. It allows to either generate fully synthetic event data, or use a subset of the user table columns as context. The cols_ctx parameter has the same meaning as for the RelSynth task.

Similarly to the RelSynth.prompt() method, the RelEvent.prompt() method also takes an optional ctx parameter, an optional n_samples parameter, and the generation directives. Moreover, the user can optionally specify:

min_n_events: The minimum number of events generated for each sample.
max_n_events: The maximum number of events generated for each sample.
forbidden_events: A set of event types (specified with the corresponding event table name) that should not be generated.

The RelEvent.inverse_transform() method takes as input the optional context, the optional context in JSON form, and the generated output of the LLM, and returns a single RelationalData object with the event data in relational data form.

With the RelEvent.generate() method the user can build the prompts, generate with an LLM, and transform back the data. It is possible to specify the optional context, the optional number of samples, and the same other optional parameters of the RelEvent.prompt() method. The output is a list with a single RelationalData object containing the generated event data.

import pandas as pd

from aindo.rdml.relational import RelationalData
from aindo.rdml.synth.llm import RelEventPreproc, RelEvent, GenConfig

data: RelationalData = ...
preproc = RelEventPreproc.from_data(
    data=data,
    ord_cols={"season": "year", "all_star": "season_id"},
)
task = RelEvent(preproc=preproc)
cfg_gen = GenConfig(
    model="your-favourite-model",
    prompt_template="Data schema: {schema}.\nSynthetic event data:",
    # may depend on the model choice (available special tokens) and the task at hand
)
data_synth, = task.generate(
    cfg_gen=cfg_gen,
    n_samples=data.n_samples[preproc.root],
)

Event data prediction

The RelEventPredict task allows the user to continue time series of event data and generate several possible futures. In this case the context must contain the full root table and the first events in the overall temporal series, which may belong to different event tables. The output will be the next events in the temporal series.

To instantiate the task, the user can optionally provide the number of events to include in the context during training (parameter n_events). If an integer value, it is the number of context events for each example. If a float, it is the fraction of context event on the total for each example. If an iterable of int or float, each value is used for the corresponding example. If None, a random fraction of context event is sampled for each example.

The RelEventPredict.prompt() method requires the input context, with the root table and any number of events. The user can specify the same optional parameters as for the RelEvent.prompt() method, providing generation directives for all columns in the event tables, controlling the minimum and maximum number of events generated, and avoiding generating certain specific types of event.

The RelEventPredict.inverse_transform() method takes as input the context, optionally the same context in JSON format, and the generated future in JSON format. The latter is provided as an iterable of n_future iterables of JSON objects, each containing a possible future for each context sample. It returns a list of RelationalData containing the n_future generated time series. With the only_future optional parameter, the user can choose whether the output should contain only the future events, or all events (and the root table).

The RelEventPredict.generate() method also requires the user to provide the context. It also accepts all optional parameters of the RelEventPredict.prompt() method, as well as the only_future optional parameter of the RelEventPredict.inverse_transform() method. Additionally, the user can set the number of generated future series, via the n_future parameter. Its output is a list of the n_future generated time series, as RelationalData objects.

import pandas as pd

from aindo.rdml.relational import RelationalData
from aindo.rdml.synth.llm import RelEventPreproc, RelEventPredict, GenConfig

data: RelationalData = ...
preproc = RelEventPreproc.from_data(
    data=data,
    ord_cols={"season": "year", "all_star": "season_id"},
)
task = RelEventPredict(preproc=preproc)
cfg_gen = GenConfig(
    model="your-favourite-model",
    prompt_template="Context: {ctx}, output schema: {out_schema}.\nFuture events:",
    # may depend on the model choice (available special tokens) and the task at hand
)
ctx = {  # the context must contain the root table (players) and all past events
    "players": pd.DataFrame(...),
    "season": pd.DataFrame(...),
    "all_star": pd.DataFrame(...),
},
pred = task.generate(
    cfg_gen=cfg_gen,
    ctx=ctx,
    n_future=100,  # output is a list of 100 predictions
    min_n_events=1,  # generate at least one event per sample
    max_n_events=10,  # generate at most ten event per sample
    forbidden_events=["all_star"],  # generate only events of type `season`
)