Tasks

DatasetElem `dataclass`

DatasetElem(
    ctx: str,
    out: str,
    schema: str,
    out_schema: str,
    description: str,
    task: str,
)

A dataclass representing the data included in a single dataset example.

Attributes:

Name	Type	Description
`ctx`	`str`	A JSON string representing the context data.
`out`	`str`	A JSON string representing the target.
`schema`	`str`	The JSON schema corresponding to the full data.
`out_schema`	`str`	The JSON schema of the output data.
`description`	`str`	A description.
`task`	`str`	The name of the task.

RelGenPrompt `dataclass`

RelGenPrompt(
    schema: str,
    ctx: str,
    out_schema: str,
    description: str,
    task: str,
    out_model: type[BaseModel],
)

A dataclass with the data used to build a generation prompt.

Attributes:

Name	Type	Description
`schema`	`str`	The JSON schema corresponding to the full data.
`ctx`	`str`	A JSON string representing the context data.
`out_schema`	`str`	The JSON schema of the output data.
`out_model`	`type[BaseModel]`	The type of the pydantic model used to validate the generated data.
`description`	`str`	A description.
`task`	`str`	The name of the task.

ColumnDirective `pydantic-config`

ColumnDirective(
    impute_na: bool = False,
    impute_protected: bool = False,
    mask: Sequence = (),
)

A configuration containing the generation directives for a single column.

Parameters:

Name	Type	Description	Default
`impute_na`	`bool`	Whether to impute NA values.	`False`
`impute_protected`	`bool`	Whether to impute the mask value use to protect data.	`False`
`mask`	`Sequence`	Sequence of values to exclude from generation.	`()`

TableDirective `module-attribute`

TableDirective = dict[str, ColumnDirective]

A dictionary with the generation directive for a table's columns.

GenDirective `module-attribute`

GenDirective = dict[str, TableDirective]

A dictionary with the generation directives for the dataset tables.

GenConfig `pydantic-config`

GenConfig(
    model: str,
    prompt_template: str | Callable[[GenPrompt], str],
    engine: Engine | str = VLLM,
    model_dir: Path | str | None = None,
    unsloth: bool = False,
    guided: bool = True,
    retry_on_fail: NonNegativeInt = 100,
    engine_kwargs: dict[str, Any] = Field(
        default_factory=dict
    ),
    generate_kwargs: dict[str, Any] = Field(
        default_factory=dict
    ),
)

Generation configuration.

Parameters:

Name	Type	Description	Default
`model`	`str`	Model name or path.	required
`prompt_template`	`str \| Callable[[GenPrompt], str]`	The template for the prompt. It may contain as keys the fields of `GenPrompt`.	required
`engine`	`Engine \| str`	The generation backend engine.	`VLLM`
`model_dir`	`Path \| str \| None`	The directory were to save the converted model.	`None`
`unsloth`	`bool`	Whether to use `unsloth` for merging the model.	`False`
`guided`	`bool`	Whether to use guided generation.	`True`
`retry_on_fail`	`NonNegativeInt`	How many times to retry to generate valid output before interrupting.	`100`
`engine_kwargs`	`dict[str, Any]`	Keyword arguments used when building the generator engine.	`Field(default_factory=dict)`
`generate_kwargs`	`dict[str, Any]`	Keyword arguments for the engine sampling parameters.	`Field(default_factory=dict)`

Task

An enumeration of the available tasks.

Attributes:

Name	Type	Description
`SYNTH`		Full synthetic data generation.
`SEMI_SYNTH`		Generation of synthetic data with a random component used as context.
`PREDICT`		Prediction of a subset of the data from the remaining part (context).
`EVENT`		Full generation of synthetic event data.
`EVENT_PREDICT`		Continuation of event data time series.

Synthetic data tasks

RelSynth `pydantic-config`

RelSynth(description: str = '', preproc: RelSynthPreproc, cols_ctx: list[str] = [])

Synthetic data generation task.

Parameters:

Name	Type	Description	Default
`preproc`	`RelSynthPreproc`	A preprocessor.	required
`cols_ctx`	`list[str]`	The columns of the root table to use as context.	`[]`

get_dataset

get_dataset(data: RelationalData) -> Iterator[DatasetElem]

Get the data to build dataset examples from some input data.

Parameters:

Name	Type	Description	Default
`data`	`RelationalData`	The input data.	required

Returns:

Type	Description
`Iterator[DatasetElem]`	An `Iterator` of examples in the form of `DatasetElem`.

prompt

prompt(
    ctx: Data | None = None,
    n_samples: int | None = None,
    directives: GenDirective | None = None,
) -> Iterable[RelGenPrompt]

Get the data to build generation prompts.

Parameters:

Name	Type	Description	Default
`ctx`	`Data \| None`	The context root columns, if needed.	`None`
`n_samples`	`int \| None`	The number of samples to generate. Only one between `n_samples` and `ctx` should be provided.	`None`
`directives`	`GenDirective \| None`	The generation directives for the dataset columns.	`None`

Returns:

Type	Description
`Iterable[RelGenPrompt]`	An iterable of prompt data in the form of `RelGenPrompt`.

inverse_transform

inverse_transform(
    x: Iterable[str | dict[str, Any]] | None,
    y: Iterable[str | dict[str, Any]],
    ctx: Data | None,
    progress: int | None = None,
) -> RelationalData

Get a RelationalData object from the output generated by the LLM.

Parameters:

Name	Type	Description	Default
`x`	`Iterable[str \| dict[str, Any]] \| None`	An iterable with the context (as available in `RelGenPrompt.ctx`), if any.	required
`y`	`Iterable[str \| dict[str, Any]]`	An iterable with the output generated by the LLM.	required
`ctx`	`Data \| None`	The context, if any.	required
`progress`	`int \| None`	Whether to show the progress bar.	`None`

Returns:

Type	Description
`RelationalData`	The `RelationalData`.

generate

generate(
    cfg_gen: GenConfig,
    ctx: Data | None = None,
    output_dir: Path | str | None = None,
    engine_args: Sequence[str] = (),
    n_samples: int | None = None,
    directives: GenDirective | None = None,
) -> list[RelationalData]

Generate synthetic data.

Parameters:

Name	Type	Description	Default
`cfg_gen`	`GenConfig`	The generation configuration.	required
`ctx`	`Data \| None`	The context.	`None`
`output_dir`	`Path \| str \| None`	The output directory for generation logs.	`None`
`engine_args`	`Sequence[str]`	CLI arguments for the generation engine.	`()`
`n_samples`	`int \| None`	The number of samples to generate. Only one between `n_samples` and `ctx` should be provided.	`None`
`directives`	`GenDirective \| None`	The generation directives for the dataset columns.	`None`

Returns:

Type	Description
`list[RelationalData]`	A list of `RelationalData` objects, with a single element containing the generated synthetic data.

RelSemiSynth `pydantic-config`

RelSemiSynth(description: str = '', preproc: RelSynthPreproc, p_field: Annotated[float, Field(ge=0, le=1)], p_child: Annotated[float, Field(ge=0, le=1)], ctx_as_const: bool = False, rng_train: NpRng = None)

Semisynthetic data generation task.

Parameters:

Name	Type	Description	Default
`preproc`	`RelSynthPreproc`	A preprocessor.	required
`p_field`	`Annotated[float, Field(ge=0, le=1)]`	The probability that a column field is selected in the context.	required
`p_child`	`Annotated[float, Field(ge=0, le=1)]`	The probability that a child field is selected in the context.	required
`ctx_as_const`	`bool`	Whether to leave the context fields as constants in the output JSON schema.	`False`
`rng_train`	`NpRng`	A numpy random number generator used to split the training data.	`None`

get_dataset

get_dataset(data: RelationalData) -> Iterator[DatasetElem]

Get the data to build dataset examples from some input data.

Parameters:

Name	Type	Description	Default
`data`	`RelationalData`	The input data.	required

Returns:

Type	Description
`Iterator[DatasetElem]`	An `Iterator` of examples in the form of `DatasetElem`.

prompt

prompt(
    ctx: Data, rng: NpRng = None
) -> Iterator[RelGenPrompt]

Get the data to build generation prompts.

Parameters:

Name	Type	Description	Default
`ctx`	`Data`	The data to sample the context from.	required
`rng`	`NpRng`	A numpy random number generator used to sample the context from the provided data.	`None`

Returns:

Type	Description
`Iterator[RelGenPrompt]`	An iterable of prompt data in the form of `RelGenPrompt`.

inverse_transform

inverse_transform(
    ctx: Iterable[str | dict[str, Any]],
    y: Iterable[str | dict[str, Any]],
    progress: int | None = None,
) -> RelationalData

Get a RelationalData object from the output generated by the LLM.

Parameters:

Name	Type	Description	Default
`ctx`	`Iterable[str \| dict[str, Any]]`	An iterable with the context (as available in `RelGenPrompt.ctx`).	required
`y`	`Iterable[str \| dict[str, Any]]`	An iterable with the output generated by the LLM.	required
`progress`	`int \| None`	Whether to show the progress bar.	`None`

Returns:

Type	Description
`RelationalData`	The `RelationalData`.

generate

generate(
    cfg_gen: GenConfig,
    ctx: Data | None = None,
    output_dir: Path | str | None = None,
    engine_args: Sequence[str] = (),
    rng: NpRng = None,
) -> list[RelationalData]

Generate semisynthetic data.

Parameters:

Name	Type	Description	Default
`cfg_gen`	`GenConfig`	The generation configuration.	required
`ctx`	`Data \| None`	The context.	`None`
`output_dir`	`Path \| str \| None`	The output directory for generation logs.	`None`
`engine_args`	`Sequence[str]`	CLI arguments for the generation engine.	`()`
`rng`	`NpRng`	A numpy random number generator used to sample the context from the provided data.	`None`

Returns:

Type	Description
`list[RelationalData]`	A list of `RelationalData` objects, with a single element containing the generated semi-synthetic data.

RelPredict `pydantic-config`

RelPredict(description: str = '', preproc: RelSynthPreproc, cols_tgt: dict[str, list[str]])

Prediction task.

Parameters:

Name	Type	Description	Default
`preproc`	`RelSynthPreproc`	A preprocessor.	required
`cols_tgt`	`dict[str, list[str]]`	A dictionary with the target columns to be predicted for each table. A missing table wil be considered as fully part of the context.	required

get_dataset

get_dataset(data: RelationalData) -> Iterator[DatasetElem]

Get the data to build dataset examples from some input data.

Parameters:

Name	Type	Description	Default
`data`	`RelationalData`	The input data.	required

Returns:

Type	Description
`Iterator[DatasetElem]`	An `Iterator` of examples in the form of `DatasetElem`.

prompt

prompt(
    ctx: Data, directives: GenDirective | None = None
) -> Iterable[RelGenPrompt]

Get the data to build generation prompts.

Parameters:

Name	Type	Description	Default
`ctx`	`Data`	The data with the context columns.	required
`directives`	`GenDirective \| None`	The generation directives for the target columns.	`None`

Returns:

Type	Description
`Iterable[RelGenPrompt]`	An iterable of prompt data in the form of `RelGenPrompt`.

inverse_transform

inverse_transform(
    ctx: Data,
    pred: Iterable[Iterable[str | dict[str, Any]]],
    progress: int | None = None,
) -> list[RelationalData]

Get the predictions in form of RelationalData objects from the output generated by the LLM.

Parameters:

Name	Type	Description	Default
`ctx`	`Data`	The context data.	required
`pred`	`Iterable[Iterable[str \| dict[str, Any]]]`	An iterable with the predictions generated by the LLM.	required
`progress`	`int \| None`	Whether to show the progress bar.	`None`

Returns:

Type	Description
`list[RelationalData]`	A list of `RelationalData` object containing the predicted target columns.

generate

generate(
    cfg_gen: GenConfig,
    ctx: Data | None = None,
    output_dir: Path | str | None = None,
    engine_args: Sequence[str] = (),
    n_pred: int = 1,
    directives: GenDirective | None = None,
) -> list[RelationalData]

Generate several predictions.

Parameters:

Name	Type	Description	Default
`cfg_gen`	`GenConfig`	The generation configuration.	required
`ctx`	`Data \| None`	The context.	`None`
`output_dir`	`Path \| str \| None`	The output directory for generation logs.	`None`
`engine_args`	`Sequence[str]`	CLI arguments for the generation engine.	`()`
`n_pred`	`int`	The number of predictions for each prompt.	`1`
`directives`	`GenDirective \| None`	The generation directives for the target columns.	`None`

Returns:

Type	Description
`list[RelationalData]`	A list of `RelationalData` objects, with the `n_pred` generated predictions.

Event data tasks

RelEvent `pydantic-config`

RelEvent(description: str = '', preproc: RelEventPreproc, cols_ctx: list[str] = [])

Synthetic event data generation task.

Parameters:

Name	Type	Description	Default
`preproc`	`RelEventPreproc`	A preprocessor.	required
`cols_ctx`	`list[str]`	The columns of the root table to use as context.	`[]`

get_dataset

get_dataset(data: RelationalData) -> Iterator[DatasetElem]

Get the data to build dataset examples from some input data.

Parameters:

Name	Type	Description	Default
`data`	`RelationalData`	The input data.	required

Returns:

Type	Description
`Iterator[DatasetElem]`	An `Iterator` of examples in the form of `DatasetElem`.

prompt

prompt(
    ctx: Data | None = None,
    n_samples: int | None = None,
    min_n_events: int | None = None,
    max_n_events: int | None = None,
    forbidden_events: Collection[str] = (),
    directives: GenDirective | None = None,
) -> Iterator[RelGenPrompt]

Get the data to build generation prompts.

Parameters:

Name	Type	Description	Default
`ctx`	`Data \| None`	The context root columns, if needed.	`None`
`n_samples`	`int \| None`	The number of samples to generate. Only one between `n_samples` and `ctx` should be provided.	`None`
`min_n_events`	`int \| None`	The optional minimum number of generated events per prompt.	`None`
`max_n_events`	`int \| None`	The optional maximum number of generated events per prompt.	`None`
`forbidden_events`	`Collection[str]`	An optional collection of events that should not be generated.	`()`
`directives`	`GenDirective \| None`	The generation directives for the dataset columns.	`None`

Returns:

Type	Description
`Iterator[RelGenPrompt]`	An iterable of prompt data in the form of `RelGenPrompt`.

inverse_transform

inverse_transform(
    ctx: Data | None,
    x: Iterable[str | dict[str, Any]] | None,
    y: Iterable[str | dict[str, Any]],
    progress: int | None = None,
) -> RelationalData

Get a RelationalData object from the output generated by the LLM.

Parameters:

Name	Type	Description	Default
`ctx`	`Data \| None`	The context, if any.	required
`x`	`Iterable[str \| dict[str, Any]] \| None`	An iterable with the context (as available in `RelGenPrompt.ctx`), if any.	required
`y`	`Iterable[str \| dict[str, Any]]`	An iterable with the output generated by the LLM.	required
`progress`	`int \| None`	Whether to show the progress bar.	`None`

Returns:

Type	Description
`RelationalData`	The `RelationalData`.

generate

generate(
    cfg_gen: GenConfig,
    ctx: Data | None = None,
    output_dir: Path | str | None = None,
    engine_args: Sequence[str] = (),
    n_samples: int | None = None,
    min_n_events: int | None = None,
    max_n_events: int | None = None,
    forbidden_events: Collection[str] = (),
    directives: GenDirective | None = None,
) -> list[RelationalData]

Generate synthetic event data.

Parameters:

Name	Type	Description	Default
`cfg_gen`	`GenConfig`	The generation configuration.	required
`ctx`	`Data \| None`	The context.	`None`
`output_dir`	`Path \| str \| None`	The output directory for generation logs.	`None`
`engine_args`	`Sequence[str]`	CLI arguments for the generation engine.	`()`
`n_samples`	`int \| None`	The number of samples to generate. Only one between `n_samples` and `ctx` should be provided.	`None`
`min_n_events`	`int \| None`	The optional minimum number of generated events per prompt.	`None`
`max_n_events`	`int \| None`	The optional maximum number of generated events per prompt.	`None`
`forbidden_events`	`Collection[str]`	An optional collection of events that should not be generated.	`()`
`directives`	`GenDirective \| None`	The generation directives for the dataset columns.	`None`

Returns:

Type	Description
`list[RelationalData]`	A list of `RelationalData` objects, with a single element containing the generated synthetic data.

RelEventPredict `pydantic-config`

RelEventPredict(description: str = '', preproc: RelEventPreproc, n_events: int | float | Iterable[int | float] | None = None)

Synthetic event data generation task.

Parameters:

Name	Type	Description	Default
`preproc`	`RelEventPreproc`	A preprocessor.	required
`n_events`	`int \| float \| Iterable[int \| float] \| None`	The number of events to include in the context during training. If an int, it represents the number of context events for each example. If a float, it is used as the fraction of context event on the total for each example. If an iterable of int or float, each value is used for the corresponding example. If None, a random fraction is sampled for each example.	`None`

get_dataset

get_dataset(data: RelationalData) -> Iterator[DatasetElem]

Get the data to build dataset examples from some input data.

Parameters:

Name	Type	Description	Default
`data`	`RelationalData`	The input data.	required

Returns:

Type	Description
`Iterator[DatasetElem]`	An `Iterator` of examples in the form of `DatasetElem`.

prompt

prompt(
    ctx: Data,
    min_n_events: int | None = None,
    max_n_events: int | None = None,
    forbidden_events: Collection[str] = (),
    directives: GenDirective | None = None,
) -> Iterator[RelGenPrompt]

Get the data to build generation prompts.

Parameters:

Name	Type	Description	Default
`ctx`	`Data`	The context data.	required
`min_n_events`	`int \| None`	The optional minimum number of generated events per prompt.	`None`
`max_n_events`	`int \| None`	The optional maximum number of generated events per prompt.	`None`
`forbidden_events`	`Collection[str]`	An optional collection of events that should not be generated.	`()`
`directives`	`GenDirective \| None`	The generation directives for the dataset columns.	`None`

Returns:

Type	Description
`Iterator[RelGenPrompt]`	An iterable of prompt data in the form of `RelGenPrompt`.

inverse_transform

inverse_transform(
    ctx: Data,
    x: Iterable[str | dict[str, Any]] | None,
    future: Iterable[Iterable[str | dict[str, Any]]],
    only_future: bool = False,
    progress: int = 0,
) -> list[RelationalData]

Get the predicted futures as a list of RelationalData objects from the output generated by the LLM.

Parameters:

Name	Type	Description	Default
`ctx`	`Data`	The context data.	required
`x`	`Iterable[str \| dict[str, Any]] \| None`	An optional iterable with the context (as available in `RelGenPrompt.ctx). If None, the context is obtained from`ctx`.	required
`future`	`Iterable[Iterable[str \| dict[str, Any]]]`	An iterable with the future generated by the LLM.	required
`only_future`	`bool`	Whether the final `RelationalData` objects should only contain the future.	`False`
`progress`	`int`	Whether to show the progress bar.	`0`

Returns:

Type	Description
`list[RelationalData]`	A list of `RelationalData` object containing the predicted futures.

generate

generate(
    cfg_gen: GenConfig,
    ctx: Data | None = None,
    output_dir: Path | str | None = None,
    engine_args: Sequence[str] = (),
    n_future: int = 1,
    min_n_events: int | None = None,
    max_n_events: int | None = None,
    forbidden_events: Collection[str] = (),
    only_future: bool = False,
    directives: GenDirective | None = None,
) -> list[RelationalData]

Generate synthetic event data.

Parameters:

Name	Type	Description	Default
`cfg_gen`	`GenConfig`	The generation configuration.	required
`ctx`	`Data \| None`	The context.	`None`
`output_dir`	`Path \| str \| None`	The output directory for generation logs.	`None`
`engine_args`	`Sequence[str]`	CLI arguments for the generation engine.	`()`
`n_future`	`int`	The number of predicted futures for each prompt.	`1`
`min_n_events`	`int \| None`	The optional minimum number of generated events per prompt.	`None`
`max_n_events`	`int \| None`	The optional maximum number of generated events per prompt.	`None`
`forbidden_events`	`Collection[str]`	An optional collection of events that should not be generated.	`()`
`only_future`	`bool`	Whether the final `RelationalData` objects should only contain the future.	`False`
`directives`	`GenDirective \| None`	The generation directives for the dataset columns.	`None`

Returns:

Type	Description
`list[RelationalData]`	A list of `RelationalData` objects, with the `n_future` generated future predictions.

Tasks

DatasetElem dataclass

RelGenPrompt dataclass

ColumnDirective pydantic-config

TableDirective module-attribute

GenDirective module-attribute

GenConfig pydantic-config

Task

Synthetic data tasks

RelSynth pydantic-config

get_dataset

prompt

inverse_transform

generate

RelSemiSynth pydantic-config

get_dataset

prompt

inverse_transform

generate

RelPredict pydantic-config

get_dataset

prompt

inverse_transform

generate

Event data tasks

RelEvent pydantic-config

get_dataset

prompt

inverse_transform

generate

RelEventPredict pydantic-config

get_dataset

prompt

inverse_transform

generate

DatasetElem `dataclass`

RelGenPrompt `dataclass`

ColumnDirective `pydantic-config`

TableDirective `module-attribute`

GenDirective `module-attribute`

GenConfig `pydantic-config`

RelSynth `pydantic-config`

RelSemiSynth `pydantic-config`

RelPredict `pydantic-config`

RelEvent `pydantic-config`

RelEventPredict `pydantic-config`