Tasks
The aindo.rdml.synth.llm module allows the user to tackle several tasks.
The list of available tasks can be found in the Task enumeration.
For each Task element, there is a BaseTask object
available, which provides the necessary pre- and post-processing utilities.
All BaseTask objects are built starting form one of the preprocessors
described in the preprocessing section.
Optionally the user can provide a description, that can be used to build the prompts for training or inference.
Each BaseTask object provides the following method:
-
get_dataset(): Given aRelationalDataas input, it returns an iterator ofDatasetElem, one for each data sample. ADatasetElemcontains the data to build a training example to fine-tune an LLM, including the JSON schema of the full relational dataset, the context data in JSON format, the target data in JSON format, and the JSON schema of the target data. -
prompt(): This method returns an iterator ofRelGenPrompt, EachRelGenPromptcontains the data to build a prompt for a single dataset sample. It includes the JSON schema of the full relational dataset, the context data in JSON format, the JSON schema of the target data, and apydantic.BaseModelthat can be used to validate the output data obtained from the LLM. -
inverse_transform(): This routine can be used to transform back the output of the LLM into the original relational data format. -
generate(): This method essentially combines theprompt()method, the LLM generation with aGeneratorand a providedEngine(see the section about LLM generation for further details), and theinverse_transform()method. For all tasks, thegenerate()method takes as input:cfg_gen: The generation configuration, an instance ofGenConfig, which is used to set up theGeneratorand perform the generation with theGenerator.generate()method. All the parameters ofGenConfighave the same meaning as the parameters used in theGeneratorand detailed in the generation section.ctx: The context data, it may have different uses according to the task at hand.output_dir: Optional output directory for generation logs.engine_args: Optional CLI arguments to be passed to the generation engine.
Synthetic data tasks
Synthetic data tasks are built from a RelSynthPreproc.
Synthetic data generation
The RelSynth task is used to generate synthetic data, either from scratch or
with a subset of the root table columns as context.
Beyond the preprocessor and the optional description, the user can optionally specify the
context columns by passing the list of column names to the cols_ctx parameter.
The RelSynth.prompt() method takes the following extra arguments:
n_samples: When generating fully synthetic data, and no context is needed, it can be used to specify the number of samples (rows of the root column) to generate.-
directives: AGenDirectivedictionary with the generation directives for each table. The latter are provided asTableDirectiveobjects, which in turn are dictionaries ofColumnDirectiveobjects, containing the generation options for each column:impute_na: Whether to impute NA values for that column.impute_protected: Whether to impute the protected values, if the protection was active for that column. IfFalse, the synthetic data may contain masked values in place of the rare ones.mask: Some extra values to avoid sampling when generating synthetic data. They can be among the categories of a categorical column, or among the declaredspecial_valuesfor the other column types.
The RelSynth.inverse_transform() method takes as input
an optional iterable with the context in JSON format (for example as available in
RelGenPrompt.ctx), an iterable with the JSON data generated by the LLM,
and an optional context.
It returns an instance of RelationalData with the generated synthetic data.
Finally, the RelSynth.generate() method will provide a list with a single
element corresponding to the generated synthetic data.
Beyond the parameters discussed previously, the user may also provide n_samples and directives, which are passed
to the RelSynth.prompt() method.
As an example, we show how to initialize a RelSynth task,
and how to use it to generate synthetic data, using a single root table column as context.
Moreover, for presentation purposes, we also set up the generation to avoid sampling NA values
in the race column of the players table.
import pandas as pd
from aindo.rdml.relational import RelationalData
from aindo.rdml.synth.llm import (
RelSynthPreproc,
RelSynth,
GenConfig,
GenDirective,
TableDirective,
ColumnDirective,
)
data: RelationalData = ...
preproc = RelSynthPreproc.from_data(data=data)
task = RelSynth(preproc=preproc, cols_ctx=["pos"])
cfg_gen = GenConfig(
model="your-favourite-model", # optionally fine-tuned on the synthetic data generation task
prompt_template="Data schema: {schema}, context: {ctx}.\nSynthetic data:",
# may depend on the model choice (available special tokens) and the task at hand
)
data_synth, = task.generate(
cfg_gen=cfg_gen,
ctx={"players": pd.DataFrame({"pos": pd.Series(...)})}, # use `n_samples` if there is no context
directives=GenDirective({
"players": TableDirective({
"race": ColumnDirective(impute_na=True)
})
}),
)
The same result can be obtained using the RelSynth.prompt() and the
RelSynth.inverse_transform() methods,
by explicitly taking care of the generation process.
The latter part can be done with the tools provided by aindo.rdml.synth.llm,
which are described in the section about generation.
import pandas as pd
from aindo.rdml.relational import RelationalData
from aindo.rdml.synth.llm import (
RelSynthPreproc,
RelSynth,
Generator,
Engine,
GenDirective,
TableDirective,
ColumnDirective,
)
data: RelationalData = ...
preproc = RelSynthPreproc.from_data(data=data)
task = RelSynth(preproc=preproc, cols_ctx=["pos"])
# Get the data needed to build the prompts.
ctx = {"players": pd.DataFrame({"pos": pd.Series(...)})}
prompts = task.prompt(
ctx=ctx,
directives=GenDirective({
"players": TableDirective({
"race": ColumnDirective(impute_na=True)
})
}),
)
# Generate the data, using an LLM. For example, you can use a `Generator` object.
generator = Generator.from_engine(
engine=Engine.VLLM,
model="your-favourite-model",
)
data_valid, _ = generator.generate(
prompt_template="Data schema: {schema}, context: {ctx}.\nSynthetic data:",
prompts=prompts,
n=1, # one generation per prompt
)
# Inverse transform
x = [p.ctx for p, d in zip(prompts, data_valid, strict=True) if d] # get the ctx used in the prompts
data_valid = [d[0] for d in data_valid if d] # there is one generated output per prompt
data_synth = task.inverse_transform(
x=x,
y=data_valid,
ctx=ctx,
progress=len(data_valid),
)
Semisynthetic data generation
With the RelSemiSynth task it is possible to generate synthetic data injecting
some random component of the real data as context for the generation.
The more of the original data is used as context, the highest the similarity between the original and the synthetic data. However, by injecting too much of the original data, it is possible to incur in some degradation of the privacy protection guarantees. The user can obtain the desired similarity-privacy balance by tuning the amount of real data used as context.
When instantiating a RelSemiSynth object, it is necessary to specify the
following parameter:
p_field: A float between 0 and 1, indicating the probability that for each row, any given column field is included in the context.p_child: A float between 0 and 1, indicating the probability that for each row, all the rows of any given child table with a foreign key referring to that row, are included in the context. If they are, their fields are then selected for the context usingp_field. Otherwise, all the child rows are generated, together with their own child rows, and so on. In other words, the whole relational structure pertaining to that original row will be fully synthetic.
Optionally the user can also specify:
ctx_as_const: Whether to insert the selected context in the target JSON schema, as constant values.rng_train: Anumpy.random.Generatoror an integer seed to fix the randomness of the context selection during training.
The RelSemiSynth.prompt() method allows the user to pass the data
that is used to sample the random context, and optionally fix the randomness with the rng parameter.
The RelSemiSynth.inverse_transform() method
requires an iterable with the context in JSON form and the JSON objects generated by the LLM,
and returns a RelationalData with the semisynthetic data.
Finally, the RelSemiSynth.generate() method returns a list with a
single element with the generated semisynthetic data.
The user can specify the randomness used in generating the prompts with the rng parameter.
import pandas as pd
from aindo.rdml.relational import RelationalData
from aindo.rdml.synth.llm import RelSynthPreproc, RelSemiSynth, GenConfig
data: RelationalData = ...
preproc = RelSynthPreproc.from_data(data=data)
task = RelSemiSynth(preproc=preproc, p_field=0.3, p_child=0.3)
cfg_gen = GenConfig(
model="your-favourite-model",
prompt_template="Data schema: {schema}, context: {ctx}.\nSynthetic data:",
# may depend on the model choice (available special tokens) and the task at hand
)
data_synth, = task.generate(
cfg_gen=cfg_gen,
ctx=data, # extract the context randomly from the original data
)
Prediction
The RelPredict task refers to the prediction of a subset of the relational
dataset columns (the target columns) from the remaining ones (the context columns).
This is identical in spirit to the predictive mode available for neural models.
To define a RelPredict object, the user must specify the target columns with the
cols_tgt parameter, as a dictionary providing the list of target columns for each concerned table.
To use the RelPredict.prompt() method, the user must provide
the context containing the context columns, and optionally a set of generation directives,
in the same fashion as those discussed for the RelSynth.prompt() method.
The latter may only contain directives for the target columns, which are the ones being generated.
The RelPredict.inverse_transform() method requires
the same context, and the output JSON generated by the LLM, as an iterable of n_pred predictions,
each consisting of an iterables of JSON objects, one for each context sample.
The output is a list of n_pred predictions in the form of RelationalData
objects, including only the target columns.
The RelPredict.generate() method also requires the context columns.
The user can then specify the number of predictions n_pred and the generation directives for the target columns.
The output is a list of n_pred RelationalData
objects, with the generated target columns.
import pandas as pd
from aindo.rdml.relational import RelationalData
from aindo.rdml.synth.llm import RelSynthPreproc, RelPredict, GenConfig
data: RelationalData = ...
preproc = RelSynthPreproc.from_data(data=data)
cols_tgt = {
"season": ["points", "assists", "steals"],
"all_star": ["points", "rebounds", "assists", "blocks"],
}
task = RelPredict(preproc=preproc, cols_tgt=cols_tgt)
cfg_gen = GenConfig(
model="your-favourite-model",
prompt_template="Context: {ctx}, output schema: {out_schema}.\nPrediction:",
# may depend on the model choice (available special tokens) and the task at hand
)
ctx = {"players": pd.DataFrame(...), "season": pd.DataFrame(...), "all_star": pd.DataFrame(...)},
# all columns but the target ones must be specified in teh context
pred = task.generate(
cfg_gen=cfg_gen,
ctx=ctx,
n_pred=100, # output is a list of 100 predictions
)
Event data tasks
Event data tasks use a RelEventPreproc.
Event data generation
The RelEvent task is similar to the RelSynth
task, but specifically for event data.
It allows to either generate fully synthetic event data, or use a subset of the user table columns as context.
The cols_ctx parameter has the same meaning as for the RelSynth task.
Similarly to the RelSynth.prompt() method,
the RelEvent.prompt() method also takes an optional ctx parameter,
an optional n_samples parameter, and the generation directives.
Moreover, the user can optionally specify:
min_n_events: The minimum number of events generated for each sample.max_n_events: The maximum number of events generated for each sample.forbidden_events: A set of event types (specified with the corresponding event table name) that should not be generated.
The RelEvent.inverse_transform() method takes as input the
optional context, the optional context in JSON form, and the generated output of the LLM,
and returns a single RelationalData object with the event data
in relational data form.
With the RelEvent.generate() method the user can build the prompts,
generate with an LLM, and transform back the data.
It is possible to specify the optional context, the optional number of samples, and the same other optional
parameters of the RelEvent.prompt() method.
The output is a list with a single RelationalData object containing the
generated event data.
import pandas as pd
from aindo.rdml.relational import RelationalData
from aindo.rdml.synth.llm import RelEventPreproc, RelEvent, GenConfig
data: RelationalData = ...
preproc = RelEventPreproc.from_data(
data=data,
ord_cols={"season": "year", "all_star": "season_id"},
)
task = RelEvent(preproc=preproc)
cfg_gen = GenConfig(
model="your-favourite-model",
prompt_template="Data schema: {schema}.\nSynthetic event data:",
# may depend on the model choice (available special tokens) and the task at hand
)
data_synth, = task.generate(
cfg_gen=cfg_gen,
n_samples=data.n_samples[preproc.root],
)
Event data prediction
The RelEventPredict task allows the user to continue time series
of event data and generate several possible futures.
In this case the context must contain the full root table and the first events in the overall temporal series,
which may belong to different event tables.
The output will be the next events in the temporal series.
To instantiate the task, the user can optionally provide the number of events to include in the context
during training (parameter n_events).
If an integer value, it is the number of context events for each example.
If a float, it is the fraction of context event on the total for each example.
If an iterable of int or float, each value is used for the corresponding example.
If None, a random fraction of context event is sampled for each example.
The RelEventPredict.prompt() method requires the input context,
with the root table and any number of events.
The user can specify the same optional parameters as for the RelEvent.prompt()
method, providing generation directives for all columns in the event tables, controlling the minimum and maximum
number of events generated, and avoiding generating certain specific types of event.
The RelEventPredict.inverse_transform() method
takes as input the context, optionally the same context in JSON format, and the generated future in JSON format.
The latter is provided as an iterable of n_future iterables of JSON objects, each containing a possible future
for each context sample.
It returns a list of RelationalData containing the n_future generated
time series.
With the only_future optional parameter, the user can choose whether the output should contain only the future
events, or all events (and the root table).
The RelEventPredict.generate() method also requires the user
to provide the context.
It also accepts all optional parameters of the RelEventPredict.prompt()
method, as well as the only_future optional parameter of the
RelEventPredict.inverse_transform() method.
Additionally, the user can set the number of generated future series, via the n_future parameter.
Its output is a list of the n_future generated time series, as
RelationalData objects.
import pandas as pd
from aindo.rdml.relational import RelationalData
from aindo.rdml.synth.llm import RelEventPreproc, RelEventPredict, GenConfig
data: RelationalData = ...
preproc = RelEventPreproc.from_data(
data=data,
ord_cols={"season": "year", "all_star": "season_id"},
)
task = RelEventPredict(preproc=preproc)
cfg_gen = GenConfig(
model="your-favourite-model",
prompt_template="Context: {ctx}, output schema: {out_schema}.\nFuture events:",
# may depend on the model choice (available special tokens) and the task at hand
)
ctx = { # the context must contain the root table (players) and all past events
"players": pd.DataFrame(...),
"season": pd.DataFrame(...),
"all_star": pd.DataFrame(...),
},
pred = task.generate(
cfg_gen=cfg_gen,
ctx=ctx,
n_future=100, # output is a list of 100 predictions
min_n_events=1, # generate at least one event per sample
max_n_events=10, # generate at most ten event per sample
forbidden_events=["all_star"], # generate only events of type `season`
)