Skip to content

Tasks

DatasetElem dataclass

DatasetElem(
    ctx: str,
    out: str,
    schema: str,
    out_schema: str,
    description: str,
    task: str,
)

A dataclass representing the data included in a single dataset example.

Attributes:

Name Type Description
ctx str

A JSON string representing the context data.

out str

A JSON string representing the target.

schema str

The JSON schema corresponding to the full data.

out_schema str

The JSON schema of the output data.

description str

A description.

task str

The name of the task.

RelGenPrompt dataclass

RelGenPrompt(
    schema: str,
    ctx: str,
    out_schema: str,
    description: str,
    task: str,
    out_model: type[BaseModel],
)

A dataclass with the data used to build a generation prompt.

Attributes:

Name Type Description
schema str

The JSON schema corresponding to the full data.

ctx str

A JSON string representing the context data.

out_schema str

The JSON schema of the output data.

out_model type[BaseModel]

The type of the pydantic model used to validate the generated data.

description str

A description.

task str

The name of the task.

ColumnDirective pydantic-config

ColumnDirective(
    impute_na: bool = False,
    impute_protected: bool = False,
    mask: Sequence = (),
)

A configuration containing the generation directives for a single column.

Parameters:

Name Type Description Default
impute_na bool

Whether to impute NA values.

False
impute_protected bool

Whether to impute the mask value use to protect data.

False
mask Sequence

Sequence of values to exclude from generation.

()

TableDirective module-attribute

TableDirective = dict[str, ColumnDirective]

A dictionary with the generation directive for a table's columns.

GenDirective module-attribute

GenDirective = dict[str, TableDirective]

A dictionary with the generation directives for the dataset tables.

GenConfig pydantic-config

GenConfig(
    model: str,
    prompt_template: str | Callable[[GenPrompt], str],
    engine: Engine | str = VLLM,
    model_dir: Path | str | None = None,
    unsloth: bool = False,
    guided: bool = True,
    retry_on_fail: NonNegativeInt = 100,
    engine_kwargs: dict[str, Any] = Field(
        default_factory=dict
    ),
    generate_kwargs: dict[str, Any] = Field(
        default_factory=dict
    ),
)

Generation configuration.

Parameters:

Name Type Description Default
model str

Model name or path.

required
prompt_template str | Callable[[GenPrompt], str]

The template for the prompt. It may contain as keys the fields of GenPrompt.

required
engine Engine | str

The generation backend engine.

VLLM
model_dir Path | str | None

The directory were to save the converted model.

None
unsloth bool

Whether to use unsloth for merging the model.

False
guided bool

Whether to use guided generation.

True
retry_on_fail NonNegativeInt

How many times to retry to generate valid output before interrupting.

100
engine_kwargs dict[str, Any]

Keyword arguments used when building the generator engine.

Field(default_factory=dict)
generate_kwargs dict[str, Any]

Keyword arguments for the engine sampling parameters.

Field(default_factory=dict)

Task

An enumeration of the available tasks.

Attributes:

Name Type Description
SYNTH

Full synthetic data generation.

SEMI_SYNTH

Generation of synthetic data with a random component used as context.

PREDICT

Prediction of a subset of the data from the remaining part (context).

EVENT

Full generation of synthetic event data.

EVENT_PREDICT

Continuation of event data time series.

Synthetic data tasks

RelSynth pydantic-config

RelSynth(description: str = '', preproc: RelSynthPreproc, cols_ctx: list[str] = [])

Synthetic data generation task.

Parameters:

Name Type Description Default
preproc RelSynthPreproc

A preprocessor.

required
cols_ctx list[str]

The columns of the root table to use as context.

[]

get_dataset

get_dataset(data: RelationalData) -> Iterator[DatasetElem]

Get the data to build dataset examples from some input data.

Parameters:

Name Type Description Default
data RelationalData

The input data.

required

Returns:

Type Description
Iterator[DatasetElem]

An Iterator of examples in the form of DatasetElem.

prompt

prompt(
    ctx: Data | None = None,
    n_samples: int | None = None,
    directives: GenDirective | None = None,
) -> Iterable[RelGenPrompt]

Get the data to build generation prompts.

Parameters:

Name Type Description Default
ctx Data | None

The context root columns, if needed.

None
n_samples int | None

The number of samples to generate. Only one between n_samples and ctx should be provided.

None
directives GenDirective | None

The generation directives for the dataset columns.

None

Returns:

Type Description
Iterable[RelGenPrompt]

An iterable of prompt data in the form of RelGenPrompt.

inverse_transform

inverse_transform(
    x: Iterable[str | dict[str, Any]] | None,
    y: Iterable[str | dict[str, Any]],
    ctx: Data | None,
    progress: int | None = None,
) -> RelationalData

Get a RelationalData object from the output generated by the LLM.

Parameters:

Name Type Description Default
x Iterable[str | dict[str, Any]] | None

An iterable with the context (as available in RelGenPrompt.ctx), if any.

required
y Iterable[str | dict[str, Any]]

An iterable with the output generated by the LLM.

required
ctx Data | None

The context, if any.

required
progress int | None

Whether to show the progress bar.

None

Returns:

Type Description
RelationalData

The RelationalData.

generate

generate(
    cfg_gen: GenConfig,
    ctx: Data | None = None,
    output_dir: Path | str | None = None,
    engine_args: Sequence[str] = (),
    n_samples: int | None = None,
    directives: GenDirective | None = None,
) -> list[RelationalData]

Generate synthetic data.

Parameters:

Name Type Description Default
cfg_gen GenConfig

The generation configuration.

required
ctx Data | None

The context.

None
output_dir Path | str | None

The output directory for generation logs.

None
engine_args Sequence[str]

CLI arguments for the generation engine.

()
n_samples int | None

The number of samples to generate. Only one between n_samples and ctx should be provided.

None
directives GenDirective | None

The generation directives for the dataset columns.

None

Returns:

Type Description
list[RelationalData]

A list of RelationalData objects, with a single element containing the generated synthetic data.

RelSemiSynth pydantic-config

RelSemiSynth(description: str = '', preproc: RelSynthPreproc, p_field: Annotated[float, Field(ge=0, le=1)], p_child: Annotated[float, Field(ge=0, le=1)], ctx_as_const: bool = False, rng_train: NpRng = None)

Semisynthetic data generation task.

Parameters:

Name Type Description Default
preproc RelSynthPreproc

A preprocessor.

required
p_field Annotated[float, Field(ge=0, le=1)]

The probability that a column field is selected in the context.

required
p_child Annotated[float, Field(ge=0, le=1)]

The probability that a child field is selected in the context.

required
ctx_as_const bool

Whether to leave the context fields as constants in the output JSON schema.

False
rng_train NpRng

A numpy random number generator used to split the training data.

None

get_dataset

get_dataset(data: RelationalData) -> Iterator[DatasetElem]

Get the data to build dataset examples from some input data.

Parameters:

Name Type Description Default
data RelationalData

The input data.

required

Returns:

Type Description
Iterator[DatasetElem]

An Iterator of examples in the form of DatasetElem.

prompt

prompt(
    ctx: Data, rng: NpRng = None
) -> Iterator[RelGenPrompt]

Get the data to build generation prompts.

Parameters:

Name Type Description Default
ctx Data

The data to sample the context from.

required
rng NpRng

A numpy random number generator used to sample the context from the provided data.

None

Returns:

Type Description
Iterator[RelGenPrompt]

An iterable of prompt data in the form of RelGenPrompt.

inverse_transform

inverse_transform(
    ctx: Iterable[str | dict[str, Any]],
    y: Iterable[str | dict[str, Any]],
    progress: int | None = None,
) -> RelationalData

Get a RelationalData object from the output generated by the LLM.

Parameters:

Name Type Description Default
ctx Iterable[str | dict[str, Any]]

An iterable with the context (as available in RelGenPrompt.ctx).

required
y Iterable[str | dict[str, Any]]

An iterable with the output generated by the LLM.

required
progress int | None

Whether to show the progress bar.

None

Returns:

Type Description
RelationalData

The RelationalData.

generate

generate(
    cfg_gen: GenConfig,
    ctx: Data | None = None,
    output_dir: Path | str | None = None,
    engine_args: Sequence[str] = (),
    rng: NpRng = None,
) -> list[RelationalData]

Generate semisynthetic data.

Parameters:

Name Type Description Default
cfg_gen GenConfig

The generation configuration.

required
ctx Data | None

The context.

None
output_dir Path | str | None

The output directory for generation logs.

None
engine_args Sequence[str]

CLI arguments for the generation engine.

()
rng NpRng

A numpy random number generator used to sample the context from the provided data.

None

Returns:

Type Description
list[RelationalData]

A list of RelationalData objects, with a single element containing the generated semi-synthetic data.

RelPredict pydantic-config

RelPredict(description: str = '', preproc: RelSynthPreproc, cols_tgt: dict[str, list[str]])

Prediction task.

Parameters:

Name Type Description Default
preproc RelSynthPreproc

A preprocessor.

required
cols_tgt dict[str, list[str]]

A dictionary with the target columns to be predicted for each table. A missing table wil be considered as fully part of the context.

required

get_dataset

get_dataset(data: RelationalData) -> Iterator[DatasetElem]

Get the data to build dataset examples from some input data.

Parameters:

Name Type Description Default
data RelationalData

The input data.

required

Returns:

Type Description
Iterator[DatasetElem]

An Iterator of examples in the form of DatasetElem.

prompt

prompt(
    ctx: Data, directives: GenDirective | None = None
) -> Iterable[RelGenPrompt]

Get the data to build generation prompts.

Parameters:

Name Type Description Default
ctx Data

The data with the context columns.

required
directives GenDirective | None

The generation directives for the target columns.

None

Returns:

Type Description
Iterable[RelGenPrompt]

An iterable of prompt data in the form of RelGenPrompt.

inverse_transform

inverse_transform(
    ctx: Data,
    pred: Iterable[Iterable[str | dict[str, Any]]],
    progress: int | None = None,
) -> list[RelationalData]

Get the predictions in form of RelationalData objects from the output generated by the LLM.

Parameters:

Name Type Description Default
ctx Data

The context data.

required
pred Iterable[Iterable[str | dict[str, Any]]]

An iterable with the predictions generated by the LLM.

required
progress int | None

Whether to show the progress bar.

None

Returns:

Type Description
list[RelationalData]

A list of RelationalData object containing the predicted target columns.

generate

generate(
    cfg_gen: GenConfig,
    ctx: Data | None = None,
    output_dir: Path | str | None = None,
    engine_args: Sequence[str] = (),
    n_pred: int = 1,
    directives: GenDirective | None = None,
) -> list[RelationalData]

Generate several predictions.

Parameters:

Name Type Description Default
cfg_gen GenConfig

The generation configuration.

required
ctx Data | None

The context.

None
output_dir Path | str | None

The output directory for generation logs.

None
engine_args Sequence[str]

CLI arguments for the generation engine.

()
n_pred int

The number of predictions for each prompt.

1
directives GenDirective | None

The generation directives for the target columns.

None

Returns:

Type Description
list[RelationalData]

A list of RelationalData objects, with the n_pred generated predictions.

Event data tasks

RelEvent pydantic-config

RelEvent(description: str = '', preproc: RelEventPreproc, cols_ctx: list[str] = [])

Synthetic event data generation task.

Parameters:

Name Type Description Default
preproc RelEventPreproc

A preprocessor.

required
cols_ctx list[str]

The columns of the root table to use as context.

[]

get_dataset

get_dataset(data: RelationalData) -> Iterator[DatasetElem]

Get the data to build dataset examples from some input data.

Parameters:

Name Type Description Default
data RelationalData

The input data.

required

Returns:

Type Description
Iterator[DatasetElem]

An Iterator of examples in the form of DatasetElem.

prompt

prompt(
    ctx: Data | None = None,
    n_samples: int | None = None,
    min_n_events: int | None = None,
    max_n_events: int | None = None,
    forbidden_events: Collection[str] = (),
    directives: GenDirective | None = None,
) -> Iterator[RelGenPrompt]

Get the data to build generation prompts.

Parameters:

Name Type Description Default
ctx Data | None

The context root columns, if needed.

None
n_samples int | None

The number of samples to generate. Only one between n_samples and ctx should be provided.

None
min_n_events int | None

The optional minimum number of generated events per prompt.

None
max_n_events int | None

The optional maximum number of generated events per prompt.

None
forbidden_events Collection[str]

An optional collection of events that should not be generated.

()
directives GenDirective | None

The generation directives for the dataset columns.

None

Returns:

Type Description
Iterator[RelGenPrompt]

An iterable of prompt data in the form of RelGenPrompt.

inverse_transform

inverse_transform(
    ctx: Data | None,
    x: Iterable[str | dict[str, Any]] | None,
    y: Iterable[str | dict[str, Any]],
    progress: int | None = None,
) -> RelationalData

Get a RelationalData object from the output generated by the LLM.

Parameters:

Name Type Description Default
ctx Data | None

The context, if any.

required
x Iterable[str | dict[str, Any]] | None

An iterable with the context (as available in RelGenPrompt.ctx), if any.

required
y Iterable[str | dict[str, Any]]

An iterable with the output generated by the LLM.

required
progress int | None

Whether to show the progress bar.

None

Returns:

Type Description
RelationalData

The RelationalData.

generate

generate(
    cfg_gen: GenConfig,
    ctx: Data | None = None,
    output_dir: Path | str | None = None,
    engine_args: Sequence[str] = (),
    n_samples: int | None = None,
    min_n_events: int | None = None,
    max_n_events: int | None = None,
    forbidden_events: Collection[str] = (),
    directives: GenDirective | None = None,
) -> list[RelationalData]

Generate synthetic event data.

Parameters:

Name Type Description Default
cfg_gen GenConfig

The generation configuration.

required
ctx Data | None

The context.

None
output_dir Path | str | None

The output directory for generation logs.

None
engine_args Sequence[str]

CLI arguments for the generation engine.

()
n_samples int | None

The number of samples to generate. Only one between n_samples and ctx should be provided.

None
min_n_events int | None

The optional minimum number of generated events per prompt.

None
max_n_events int | None

The optional maximum number of generated events per prompt.

None
forbidden_events Collection[str]

An optional collection of events that should not be generated.

()
directives GenDirective | None

The generation directives for the dataset columns.

None

Returns:

Type Description
list[RelationalData]

A list of RelationalData objects, with a single element containing the generated synthetic data.

RelEventPredict pydantic-config

RelEventPredict(description: str = '', preproc: RelEventPreproc, n_events: int | float | Iterable[int | float] | None = None)

Synthetic event data generation task.

Parameters:

Name Type Description Default
preproc RelEventPreproc

A preprocessor.

required
n_events int | float | Iterable[int | float] | None

The number of events to include in the context during training. If an int, it represents the number of context events for each example. If a float, it is used as the fraction of context event on the total for each example. If an iterable of int or float, each value is used for the corresponding example. If None, a random fraction is sampled for each example.

None

get_dataset

get_dataset(data: RelationalData) -> Iterator[DatasetElem]

Get the data to build dataset examples from some input data.

Parameters:

Name Type Description Default
data RelationalData

The input data.

required

Returns:

Type Description
Iterator[DatasetElem]

An Iterator of examples in the form of DatasetElem.

prompt

prompt(
    ctx: Data,
    min_n_events: int | None = None,
    max_n_events: int | None = None,
    forbidden_events: Collection[str] = (),
    directives: GenDirective | None = None,
) -> Iterator[RelGenPrompt]

Get the data to build generation prompts.

Parameters:

Name Type Description Default
ctx Data

The context data.

required
min_n_events int | None

The optional minimum number of generated events per prompt.

None
max_n_events int | None

The optional maximum number of generated events per prompt.

None
forbidden_events Collection[str]

An optional collection of events that should not be generated.

()
directives GenDirective | None

The generation directives for the dataset columns.

None

Returns:

Type Description
Iterator[RelGenPrompt]

An iterable of prompt data in the form of RelGenPrompt.

inverse_transform

inverse_transform(
    ctx: Data,
    x: Iterable[str | dict[str, Any]] | None,
    future: Iterable[Iterable[str | dict[str, Any]]],
    only_future: bool = False,
    progress: int = 0,
) -> list[RelationalData]

Get the predicted futures as a list of RelationalData objects from the output generated by the LLM.

Parameters:

Name Type Description Default
ctx Data

The context data.

required
x Iterable[str | dict[str, Any]] | None

An optional iterable with the context (as available in RelGenPrompt.ctx). If None, the context is obtained fromctx`.

required
future Iterable[Iterable[str | dict[str, Any]]]

An iterable with the future generated by the LLM.

required
only_future bool

Whether the final RelationalData objects should only contain the future.

False
progress int

Whether to show the progress bar.

0

Returns:

Type Description
list[RelationalData]

A list of RelationalData object containing the predicted futures.

generate

generate(
    cfg_gen: GenConfig,
    ctx: Data | None = None,
    output_dir: Path | str | None = None,
    engine_args: Sequence[str] = (),
    n_future: int = 1,
    min_n_events: int | None = None,
    max_n_events: int | None = None,
    forbidden_events: Collection[str] = (),
    only_future: bool = False,
    directives: GenDirective | None = None,
) -> list[RelationalData]

Generate synthetic event data.

Parameters:

Name Type Description Default
cfg_gen GenConfig

The generation configuration.

required
ctx Data | None

The context.

None
output_dir Path | str | None

The output directory for generation logs.

None
engine_args Sequence[str]

CLI arguments for the generation engine.

()
n_future int

The number of predicted futures for each prompt.

1
min_n_events int | None

The optional minimum number of generated events per prompt.

None
max_n_events int | None

The optional maximum number of generated events per prompt.

None
forbidden_events Collection[str]

An optional collection of events that should not be generated.

()
only_future bool

Whether the final RelationalData objects should only contain the future.

False
directives GenDirective | None

The generation directives for the dataset columns.

None

Returns:

Type Description
list[RelationalData]

A list of RelationalData objects, with the n_future generated future predictions.