Tasks
DatasetElem
dataclass
A dataclass representing the data included in a single dataset example.
Attributes:
| Name | Type | Description |
|---|---|---|
ctx |
str
|
A JSON string representing the context data. |
out |
str
|
A JSON string representing the target. |
schema |
str
|
The JSON schema corresponding to the full data. |
out_schema |
str
|
The JSON schema of the output data. |
description |
str
|
A description. |
task |
str
|
The name of the task. |
RelGenPrompt
dataclass
RelGenPrompt(
schema: str,
ctx: str,
out_schema: str,
description: str,
task: str,
out_model: type[BaseModel],
)
A dataclass with the data used to build a generation prompt.
Attributes:
| Name | Type | Description |
|---|---|---|
schema |
str
|
The JSON schema corresponding to the full data. |
ctx |
str
|
A JSON string representing the context data. |
out_schema |
str
|
The JSON schema of the output data. |
out_model |
type[BaseModel]
|
The type of the pydantic model used to validate the generated data. |
description |
str
|
A description. |
task |
str
|
The name of the task. |
ColumnDirective
pydantic-config
A configuration containing the generation directives for a single column.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
impute_na
|
bool
|
Whether to impute NA values. |
False
|
impute_protected
|
bool
|
Whether to impute the mask value use to protect data. |
False
|
mask
|
Sequence
|
Sequence of values to exclude from generation. |
()
|
TableDirective
module-attribute
TableDirective = dict[str, ColumnDirective]
A dictionary with the generation directive for a table's columns.
GenDirective
module-attribute
GenDirective = dict[str, TableDirective]
A dictionary with the generation directives for the dataset tables.
GenConfig
pydantic-config
GenConfig(
model: str,
prompt_template: str | Callable[[GenPrompt], str],
engine: Engine | str = VLLM,
model_dir: Path | str | None = None,
unsloth: bool = False,
guided: bool = True,
retry_on_fail: NonNegativeInt = 100,
engine_kwargs: dict[str, Any] = Field(
default_factory=dict
),
generate_kwargs: dict[str, Any] = Field(
default_factory=dict
),
)
Generation configuration.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
model
|
str
|
Model name or path. |
required |
prompt_template
|
str | Callable[[GenPrompt], str]
|
The template for the prompt. It may contain as keys the fields of |
required |
engine
|
Engine | str
|
The generation backend engine. |
VLLM
|
model_dir
|
Path | str | None
|
The directory were to save the converted model. |
None
|
unsloth
|
bool
|
Whether to use |
False
|
guided
|
bool
|
Whether to use guided generation. |
True
|
retry_on_fail
|
NonNegativeInt
|
How many times to retry to generate valid output before interrupting. |
100
|
engine_kwargs
|
dict[str, Any]
|
Keyword arguments used when building the generator engine. |
Field(default_factory=dict)
|
generate_kwargs
|
dict[str, Any]
|
Keyword arguments for the engine sampling parameters. |
Field(default_factory=dict)
|
Task
An enumeration of the available tasks.
Attributes:
| Name | Type | Description |
|---|---|---|
SYNTH |
Full synthetic data generation. |
|
SEMI_SYNTH |
Generation of synthetic data with a random component used as context. |
|
PREDICT |
Prediction of a subset of the data from the remaining part (context). |
|
EVENT |
Full generation of synthetic event data. |
|
EVENT_PREDICT |
Continuation of event data time series. |
Synthetic data tasks
RelSynth
pydantic-config
RelSynth(description: str = '', preproc: RelSynthPreproc, cols_ctx: list[str] = [])
Synthetic data generation task.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
preproc
|
RelSynthPreproc
|
A preprocessor. |
required |
cols_ctx
|
list[str]
|
The columns of the root table to use as context. |
[]
|
get_dataset
get_dataset(data: RelationalData) -> Iterator[DatasetElem]
Get the data to build dataset examples from some input data.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
data
|
RelationalData
|
The input data. |
required |
Returns:
| Type | Description |
|---|---|
Iterator[DatasetElem]
|
An |
prompt
prompt(
ctx: Data | None = None,
n_samples: int | None = None,
directives: GenDirective | None = None,
) -> Iterable[RelGenPrompt]
Get the data to build generation prompts.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
ctx
|
Data | None
|
The context root columns, if needed. |
None
|
n_samples
|
int | None
|
The number of samples to generate. Only one between |
None
|
directives
|
GenDirective | None
|
The generation directives for the dataset columns. |
None
|
Returns:
| Type | Description |
|---|---|
Iterable[RelGenPrompt]
|
An iterable of prompt data in the form of |
inverse_transform
inverse_transform(
x: Iterable[str | dict[str, Any]] | None,
y: Iterable[str | dict[str, Any]],
ctx: Data | None,
progress: int | None = None,
) -> RelationalData
Get a RelationalData object from the output generated by the LLM.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
x
|
Iterable[str | dict[str, Any]] | None
|
An iterable with the context (as available in |
required |
y
|
Iterable[str | dict[str, Any]]
|
An iterable with the output generated by the LLM. |
required |
ctx
|
Data | None
|
The context, if any. |
required |
progress
|
int | None
|
Whether to show the progress bar. |
None
|
Returns:
| Type | Description |
|---|---|
RelationalData
|
The |
generate
generate(
cfg_gen: GenConfig,
ctx: Data | None = None,
output_dir: Path | str | None = None,
engine_args: Sequence[str] = (),
n_samples: int | None = None,
directives: GenDirective | None = None,
) -> list[RelationalData]
Generate synthetic data.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
cfg_gen
|
GenConfig
|
The generation configuration. |
required |
ctx
|
Data | None
|
The context. |
None
|
output_dir
|
Path | str | None
|
The output directory for generation logs. |
None
|
engine_args
|
Sequence[str]
|
CLI arguments for the generation engine. |
()
|
n_samples
|
int | None
|
The number of samples to generate. Only one between |
None
|
directives
|
GenDirective | None
|
The generation directives for the dataset columns. |
None
|
Returns:
| Type | Description |
|---|---|
list[RelationalData]
|
A list of |
RelSemiSynth
pydantic-config
RelSemiSynth(description: str = '', preproc: RelSynthPreproc, p_field: Annotated[float, Field(ge=0, le=1)], p_child: Annotated[float, Field(ge=0, le=1)], ctx_as_const: bool = False, rng_train: NpRng = None)
Semisynthetic data generation task.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
preproc
|
RelSynthPreproc
|
A preprocessor. |
required |
p_field
|
Annotated[float, Field(ge=0, le=1)]
|
The probability that a column field is selected in the context. |
required |
p_child
|
Annotated[float, Field(ge=0, le=1)]
|
The probability that a child field is selected in the context. |
required |
ctx_as_const
|
bool
|
Whether to leave the context fields as constants in the output JSON schema. |
False
|
rng_train
|
NpRng
|
A numpy random number generator used to split the training data. |
None
|
get_dataset
get_dataset(data: RelationalData) -> Iterator[DatasetElem]
Get the data to build dataset examples from some input data.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
data
|
RelationalData
|
The input data. |
required |
Returns:
| Type | Description |
|---|---|
Iterator[DatasetElem]
|
An |
prompt
prompt(
ctx: Data, rng: NpRng = None
) -> Iterator[RelGenPrompt]
Get the data to build generation prompts.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
ctx
|
Data
|
The data to sample the context from. |
required |
rng
|
NpRng
|
A numpy random number generator used to sample the context from the provided data. |
None
|
Returns:
| Type | Description |
|---|---|
Iterator[RelGenPrompt]
|
An iterable of prompt data in the form of |
inverse_transform
inverse_transform(
ctx: Iterable[str | dict[str, Any]],
y: Iterable[str | dict[str, Any]],
progress: int | None = None,
) -> RelationalData
Get a RelationalData object from the output generated by the LLM.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
ctx
|
Iterable[str | dict[str, Any]]
|
An iterable with the context (as available in |
required |
y
|
Iterable[str | dict[str, Any]]
|
An iterable with the output generated by the LLM. |
required |
progress
|
int | None
|
Whether to show the progress bar. |
None
|
Returns:
| Type | Description |
|---|---|
RelationalData
|
The |
generate
generate(
cfg_gen: GenConfig,
ctx: Data | None = None,
output_dir: Path | str | None = None,
engine_args: Sequence[str] = (),
rng: NpRng = None,
) -> list[RelationalData]
Generate semisynthetic data.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
cfg_gen
|
GenConfig
|
The generation configuration. |
required |
ctx
|
Data | None
|
The context. |
None
|
output_dir
|
Path | str | None
|
The output directory for generation logs. |
None
|
engine_args
|
Sequence[str]
|
CLI arguments for the generation engine. |
()
|
rng
|
NpRng
|
A numpy random number generator used to sample the context from the provided data. |
None
|
Returns:
| Type | Description |
|---|---|
list[RelationalData]
|
A list of |
RelPredict
pydantic-config
Prediction task.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
preproc
|
RelSynthPreproc
|
A preprocessor. |
required |
cols_tgt
|
dict[str, list[str]]
|
A dictionary with the target columns to be predicted for each table. A missing table wil be considered as fully part of the context. |
required |
get_dataset
get_dataset(data: RelationalData) -> Iterator[DatasetElem]
Get the data to build dataset examples from some input data.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
data
|
RelationalData
|
The input data. |
required |
Returns:
| Type | Description |
|---|---|
Iterator[DatasetElem]
|
An |
prompt
prompt(
ctx: Data, directives: GenDirective | None = None
) -> Iterable[RelGenPrompt]
Get the data to build generation prompts.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
ctx
|
Data
|
The data with the context columns. |
required |
directives
|
GenDirective | None
|
The generation directives for the target columns. |
None
|
Returns:
| Type | Description |
|---|---|
Iterable[RelGenPrompt]
|
An iterable of prompt data in the form of |
inverse_transform
inverse_transform(
ctx: Data,
pred: Iterable[Iterable[str | dict[str, Any]]],
progress: int | None = None,
) -> list[RelationalData]
Get the predictions in form of RelationalData objects from the output generated by the LLM.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
ctx
|
Data
|
The context data. |
required |
pred
|
Iterable[Iterable[str | dict[str, Any]]]
|
An iterable with the predictions generated by the LLM. |
required |
progress
|
int | None
|
Whether to show the progress bar. |
None
|
Returns:
| Type | Description |
|---|---|
list[RelationalData]
|
A list of |
generate
generate(
cfg_gen: GenConfig,
ctx: Data | None = None,
output_dir: Path | str | None = None,
engine_args: Sequence[str] = (),
n_pred: int = 1,
directives: GenDirective | None = None,
) -> list[RelationalData]
Generate several predictions.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
cfg_gen
|
GenConfig
|
The generation configuration. |
required |
ctx
|
Data | None
|
The context. |
None
|
output_dir
|
Path | str | None
|
The output directory for generation logs. |
None
|
engine_args
|
Sequence[str]
|
CLI arguments for the generation engine. |
()
|
n_pred
|
int
|
The number of predictions for each prompt. |
1
|
directives
|
GenDirective | None
|
The generation directives for the target columns. |
None
|
Returns:
| Type | Description |
|---|---|
list[RelationalData]
|
A list of |
Event data tasks
RelEvent
pydantic-config
RelEvent(description: str = '', preproc: RelEventPreproc, cols_ctx: list[str] = [])
Synthetic event data generation task.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
preproc
|
RelEventPreproc
|
A preprocessor. |
required |
cols_ctx
|
list[str]
|
The columns of the root table to use as context. |
[]
|
get_dataset
get_dataset(data: RelationalData) -> Iterator[DatasetElem]
Get the data to build dataset examples from some input data.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
data
|
RelationalData
|
The input data. |
required |
Returns:
| Type | Description |
|---|---|
Iterator[DatasetElem]
|
An |
prompt
prompt(
ctx: Data | None = None,
n_samples: int | None = None,
min_n_events: int | None = None,
max_n_events: int | None = None,
forbidden_events: Collection[str] = (),
directives: GenDirective | None = None,
) -> Iterator[RelGenPrompt]
Get the data to build generation prompts.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
ctx
|
Data | None
|
The context root columns, if needed. |
None
|
n_samples
|
int | None
|
The number of samples to generate. Only one between |
None
|
min_n_events
|
int | None
|
The optional minimum number of generated events per prompt. |
None
|
max_n_events
|
int | None
|
The optional maximum number of generated events per prompt. |
None
|
forbidden_events
|
Collection[str]
|
An optional collection of events that should not be generated. |
()
|
directives
|
GenDirective | None
|
The generation directives for the dataset columns. |
None
|
Returns:
| Type | Description |
|---|---|
Iterator[RelGenPrompt]
|
An iterable of prompt data in the form of |
inverse_transform
inverse_transform(
ctx: Data | None,
x: Iterable[str | dict[str, Any]] | None,
y: Iterable[str | dict[str, Any]],
progress: int | None = None,
) -> RelationalData
Get a RelationalData object from the output generated by the LLM.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
ctx
|
Data | None
|
The context, if any. |
required |
x
|
Iterable[str | dict[str, Any]] | None
|
An iterable with the context (as available in |
required |
y
|
Iterable[str | dict[str, Any]]
|
An iterable with the output generated by the LLM. |
required |
progress
|
int | None
|
Whether to show the progress bar. |
None
|
Returns:
| Type | Description |
|---|---|
RelationalData
|
The |
generate
generate(
cfg_gen: GenConfig,
ctx: Data | None = None,
output_dir: Path | str | None = None,
engine_args: Sequence[str] = (),
n_samples: int | None = None,
min_n_events: int | None = None,
max_n_events: int | None = None,
forbidden_events: Collection[str] = (),
directives: GenDirective | None = None,
) -> list[RelationalData]
Generate synthetic event data.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
cfg_gen
|
GenConfig
|
The generation configuration. |
required |
ctx
|
Data | None
|
The context. |
None
|
output_dir
|
Path | str | None
|
The output directory for generation logs. |
None
|
engine_args
|
Sequence[str]
|
CLI arguments for the generation engine. |
()
|
n_samples
|
int | None
|
The number of samples to generate. Only one between |
None
|
min_n_events
|
int | None
|
The optional minimum number of generated events per prompt. |
None
|
max_n_events
|
int | None
|
The optional maximum number of generated events per prompt. |
None
|
forbidden_events
|
Collection[str]
|
An optional collection of events that should not be generated. |
()
|
directives
|
GenDirective | None
|
The generation directives for the dataset columns. |
None
|
Returns:
| Type | Description |
|---|---|
list[RelationalData]
|
A list of |
RelEventPredict
pydantic-config
RelEventPredict(description: str = '', preproc: RelEventPreproc, n_events: int | float | Iterable[int | float] | None = None)
Synthetic event data generation task.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
preproc
|
RelEventPreproc
|
A preprocessor. |
required |
n_events
|
int | float | Iterable[int | float] | None
|
The number of events to include in the context during training. If an int, it represents the number of context events for each example. If a float, it is used as the fraction of context event on the total for each example. If an iterable of int or float, each value is used for the corresponding example. If None, a random fraction is sampled for each example. |
None
|
get_dataset
get_dataset(data: RelationalData) -> Iterator[DatasetElem]
Get the data to build dataset examples from some input data.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
data
|
RelationalData
|
The input data. |
required |
Returns:
| Type | Description |
|---|---|
Iterator[DatasetElem]
|
An |
prompt
prompt(
ctx: Data,
min_n_events: int | None = None,
max_n_events: int | None = None,
forbidden_events: Collection[str] = (),
directives: GenDirective | None = None,
) -> Iterator[RelGenPrompt]
Get the data to build generation prompts.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
ctx
|
Data
|
The context data. |
required |
min_n_events
|
int | None
|
The optional minimum number of generated events per prompt. |
None
|
max_n_events
|
int | None
|
The optional maximum number of generated events per prompt. |
None
|
forbidden_events
|
Collection[str]
|
An optional collection of events that should not be generated. |
()
|
directives
|
GenDirective | None
|
The generation directives for the dataset columns. |
None
|
Returns:
| Type | Description |
|---|---|
Iterator[RelGenPrompt]
|
An iterable of prompt data in the form of |
inverse_transform
inverse_transform(
ctx: Data,
x: Iterable[str | dict[str, Any]] | None,
future: Iterable[Iterable[str | dict[str, Any]]],
only_future: bool = False,
progress: int = 0,
) -> list[RelationalData]
Get the predicted futures as a list of RelationalData objects from the output generated by the LLM.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
ctx
|
Data
|
The context data. |
required |
x
|
Iterable[str | dict[str, Any]] | None
|
An optional iterable with the context (as available in |
required |
future
|
Iterable[Iterable[str | dict[str, Any]]]
|
An iterable with the future generated by the LLM. |
required |
only_future
|
bool
|
Whether the final |
False
|
progress
|
int
|
Whether to show the progress bar. |
0
|
Returns:
| Type | Description |
|---|---|
list[RelationalData]
|
A list of |
generate
generate(
cfg_gen: GenConfig,
ctx: Data | None = None,
output_dir: Path | str | None = None,
engine_args: Sequence[str] = (),
n_future: int = 1,
min_n_events: int | None = None,
max_n_events: int | None = None,
forbidden_events: Collection[str] = (),
only_future: bool = False,
directives: GenDirective | None = None,
) -> list[RelationalData]
Generate synthetic event data.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
cfg_gen
|
GenConfig
|
The generation configuration. |
required |
ctx
|
Data | None
|
The context. |
None
|
output_dir
|
Path | str | None
|
The output directory for generation logs. |
None
|
engine_args
|
Sequence[str]
|
CLI arguments for the generation engine. |
()
|
n_future
|
int
|
The number of predicted futures for each prompt. |
1
|
min_n_events
|
int | None
|
The optional minimum number of generated events per prompt. |
None
|
max_n_events
|
int | None
|
The optional maximum number of generated events per prompt. |
None
|
forbidden_events
|
Collection[str]
|
An optional collection of events that should not be generated. |
()
|
only_future
|
bool
|
Whether the final |
False
|
directives
|
GenDirective | None
|
The generation directives for the dataset columns. |
None
|
Returns:
| Type | Description |
|---|---|
list[RelationalData]
|
A list of |