Models

TabularModel

build `classmethod`

build(
    preproc: TabularPreproc,
    size: str | Size | TabularModelSize,
    block: str | None = None,
    dropout: float | None = 0.12,
) -> TabularModel

Tabular model to generate synthetic tabular relational data.

Parameters:

Name	Type	Description	Default
`preproc`	`TabularPreproc`	A `TabularPreproc` object.	required
`size`	`str \| Size \| TabularModelSize`	The size configuration of the model. Could be either a `TabularModelSize` object, a `Size` object, or a string representation of the latter.	required
`block`	`str \| None`	The block type. The possible values depend on whether the data is single table or multi table. For a single table, either 'free' (default), 'causal', or 'lstm'. For multi table data, either 'free' (default) or 'lstm'.	`None`
`dropout`	`float \| None`	The dropout probability.	`0.12`

Returns:

Type	Description
`TabularModel`	A `TabularModel` instance.

generate

generate(
    n_samples: int | None = None,
    ctx: dict[str, DataFrame] | None = None,
    batch_size: int = 0,
    max_block_size: int = 0,
    temp: float = 1.0,
) -> RelationalData

Generate synthetic relational data.

Parameters:

Name	Type	Description	Default
`n_samples`	`int \| None`	Desired number of samples in the root table. Must be given if and only if `ctx` is not provided.	`None`
`ctx`	`dict[str, DataFrame] \| None`	The context from where to start a conditional generation. If provided, `n_samples` should not be given. The content of the `pd.DataFrame`'s must match the context columns provided to the `TabularPreproc`: 1. If only a subset of the root table's columns are provided, the model will generate the foreign keys. 2. If a subset of columns for each table are provided, each `pd.DataFrame` must also contain the primary and foreign keys, and the generated synthetic data will have the same keys provided by the context. The foreign keys referring to lookup tables should be treated as feature columns, not as foreign keys.	`None`
`batch_size`	`int`	Batch size used during generation. If 0, all data is generated in a single batch.	`0`
`max_block_size`	`int`	Maximum length for each generated sample. Active only for multi-table datasets and for generation from the root table (denoted above as 1.). If 0, no limit is enforced.	`0`
`temp`	`float`	Temperature parameter for sampling.	`1.0`

Returns:

Type	Description
`RelationalData`	A `RelationalData` object with the generated synthetic tabular data.

predict_proba

predict_proba(
    ctx: dict[str, DataFrame], batch_size: int = 0
) -> PredProb

Predict probabilities for each category of a single categorical column. In order to use this function, the context must contain all columns except for a single categorical column in the root table.

Parameters:

Name	Type	Description	Default
`ctx`	`dict[str, DataFrame]`	The input context. Must contain all columns except for a single categorical column in the root table.	required
`batch_size`	`int`	Batch size used during prediction. If 0, all predictions are performed in a single batch.	`0`

Returns:

Type	Description
`PredProb`	A `PredProb` object, containing the predicted probabilities and the corresponding categories.

predict_sample

predict_sample(
    ctx: dict[str, DataFrame],
    n: int,
    batch_size: int = 0,
    temp: float = 1.0,
    rng: Generator | int | None = None,
) -> PredSample

Make n prediction samples from a given context.

Parameters:

Name	Type	Description	Default
`ctx`	`dict[str, DataFrame]`	The context from where to compute the `n` prediction samples.	required
`n`	`int`	The number of prediction samples.	required
`batch_size`	`int`	Batch size used during prediction. If 0, all predictions are performed in a single batch.	`0`
`temp`	`float`	Temperature parameter for sampling.	`1.0`
`rng`	`Generator \| int \| None`	A `torch.Generator` or an integer seed to control the randomness in the sample. If None, a random seed is generated.	`None`

Returns:

Type	Description
`PredSample`	A `PredSample` object, namely a list of the predicted `RelationalData` (containing only
`PredSample`	the predicted columns). The common context can be retrieved from the `PredSample.ctx` attribute.

save

save(path: Path | str) -> None

Save the TabularModel to a checkpoint at the given path.

Parameters:

Name	Type	Description	Default
`path`	`Path \| str`	The path where to sve the checkpoint.	required

load `classmethod`

load(path: Path | str) -> TabularModel

Load the TabularModel from the checkpoint at the given path.

Parameters:

Name	Type	Description	Default
`path`	`Path \| str`	The path to the loaded checkpoint.	required

Returns:

Type	Description
`TabularModel`	The loaded `TabularModel`.

TextModel

build `classmethod`

build(
    preproc: TextPreproc,
    size: str | Size | TextModelSize,
    block_size: int,
    dropout: float | None = 0.12,
) -> TextModel

Text model to generate synthetic text columns of a table which is part of a relational structure.

Parameters:

Name	Type	Description	Default
`preproc`	`TextPreproc`	A `TextPreproc` object.	required
`size`	`str \| Size \| TextModelSize`	The size configuration of the model. Could be either a `Size` object, a `TextModelSize` object or a string representation of such objects.	required
`block_size`	`int`	Maximum text sequence length that the model can process.	required
`dropout`	`float \| None`	The dropout probability.	`0.12`

Returns:

Type	Description
`TextModel`	A `TextModel` instance.

build_from_pretrained `classmethod`

build_from_pretrained(
    preproc: TextPreproc,
    path: Path | str,
    block_size: int | None = None,
) -> TextModel

Build a text model from a pretrained model.

Parameters:

Name	Type	Description	Default
`preproc`	`TextPreproc`	A `TextPreproc` object.	required
`path`	`Path \| str`	The path to the checkpoint of the pre-trained model.	required
`block_size`	`int \| None`	Maximum text sequence length that the model can process during fine-tuning.	`None`

Returns:

Type	Description
`TextModel`	A `TextModel` instance with the weights loaded from the pre-trained model.

generate

generate(
    data: RelationalData,
    batch_size: int = 0,
    max_text_len: int = 0,
    temp: float = 1.0,
) -> RelationalData

Generate text columns in the current table.

Parameters:

Name	Type	Description	Default
`data`	`RelationalData`	A `RelationalData` object containing synthetic data.	required
`batch_size`	`int`	Batch size used during generation. If 0, generate all data in a single batch.	`0`
`max_text_len`	`int`	Maximum length for the generated text. If 0, the maximum possible value is used, namely the value of the `TabularModel.max_block_size` attribute.	`0`
`temp`	`float`	Temperature parameter for sampling.	`1.0`

Returns:

Type	Description
`RelationalData`	A `RelationalData` object with the generated synthetic text data.

save

save(path: Path | str) -> None

Save the TextModel to a checkpoint at the given path.

Parameters:

Name	Type	Description	Default
`path`	`Path \| str`	The path where to sve the checkpoint.	required

load `classmethod`

load(path: Path | str) -> TextModel

Load the TextModel from the checkpoint at the given path.

Parameters:

Name	Type	Description	Default
`path`	`Path \| str`	The path to the loaded checkpoint.	required

Returns:

Type	Description
`TextModel`	The loaded `TextModel`.

Size

Enumeration class representing different model sizes. Supported sizes are: SMALL, MEDIUM and LARGE.

TabularModelSize `dataclass`

Model size for TabularModel objects.

Parameters:

Name	Type	Description	Default
`n_layers`	`int`	Number of internal layers.	required
`h`	`int`	Number of heads.	required
`d`	`int`	Size of the internal dimension.	required

from_size `classmethod`

from_size(size: Size | str) -> TabularModelSize

Create an instance based on a given Size or its string representation.

Parameters:

Name	Type	Description	Default
`size`	`Size \| str`	A `Size` object.	required

TextModelSize `dataclass`

Model size for TextModel objects.

Parameters:

Name	Type	Description	Default
`n_layers`	`int`	Number of internal layers.	required
`h`	`int`	Number of heads.	required
`d`	`int`	Size of the internal dimension.	required

from_size `classmethod`

from_size(size: Size | str) -> TextModelSize

Create an instance based on a given Size or its string representation.

Parameters:

Name	Type	Description	Default
`size`	`Size \| str`	A `Size` or a str representing a `Size`.	required

PredProb `dataclass`

The predicted probabilities for a single categorical column.

Attributes:

Name	Type	Description
`prob`	`Tensor`	A `torch.Tensor` with the predicted probabilities, of shape (n_samples, n_categories).
`categories`	`list[str]`	The categories to which the predicted probabilities correspond to.

PredSample

The predicted samples. A list of RelationalData objects containing the predicted columns. It supports the list method list.append and the + operator.

Attributes:

Name	Type	Description
`ctx`	`RelationalData`	A `RelationalData` with the context used for the predictions.
`n_samples`	`int \| None`	The number of samples in each prediction.
`schema`	`Schema \| None`	The `Schema` of each prediction. It does not contain the context columns.

select

select(idx: Sequence[int]) -> PredSample

Select the predictions corresponding to the input indices.

Parameters:

Name	Type	Description	Default
`idx`	`Sequence[int]`	A `Sequence` of indices corresponding to the samples to select.	required

Returns:

Type	Description
`PredSample`	A `PredSample` object with the selected predictions.

XgbModel

init

__init__(
    schema: Schema,
    ctx_cols: Sequence[str] = (),
    preprocessors: dict[
        str, ColumnPreproc | ArColumn | None
    ]
    | None = None,
    n_estimators: int | None = 1000,
    valid_frac: float | None = 0.0,
    **kwargs: Any,
) -> None

A generative model based on autoregressive XGBoost models. Can be used only with single-table data.

Parameters:

Name	Type	Description	Default
`schema`	`Schema`	The `Schema` of the data.	required
`ctx_cols`	`Sequence[str]`	The columns to be used as context.	`()`
`preprocessors`	`dict[str, ColumnPreproc \| ArColumn \| None] \| None`	A dictionary containing preprocessing instructions for each column in the table. Preprocessing instructions can be instances of `ColumnPreproc`, a column preprocessor, or None. If None, the column will be ignored. For the columns for which a preprocessor is not provided, the default preprocessor will be instantiated based on the `Column` type defined in the `Schema`.	`None`
`n_estimators`	`int \| None`	Number of estimators for the XGBoost models.	`1000`
`valid_frac`	`float \| None`	Fraction of the training data to be used for validation.	`0.0`
`**kwargs`	`Any`	Keyword arguments to be passed to the XGBoost models.	`{}`

fit

fit(data: RelationalData) -> XgbModel

Fit the XgbModel to the given RelationalData.

Parameters:

Name	Type	Description	Default
`data`	`RelationalData`	The `RelationalData` to fit the `XgbModel` to.	required

Returns:

Type	Description
`XgbModel`	The fitted instance of the `XgbModel`.

train

train(data: RelationalData) -> XgbModel

Train the XgbModel with the input RelationalData.

Parameters:

Name	Type	Description	Default
`data`	`RelationalData`	The training data, as a `RelationalData` object.	required

Returns:

Type	Description
`XgbModel`	The trained instance of the `XgbModel`.

generate

generate(
    n_samples: int | None = None,
    ctx: DataFrame | None = None,
    batch_size: int = 0,
    temp: float = 1.0,
) -> RelationalData

Generate synthetic data.

Parameters:

Name	Type	Description	Default
`n_samples`	`int \| None`	The desired number of samples. Must be given if and only if `ctx` is not provided.	`None`
`ctx`	`DataFrame \| None`	The columns of the context from where to start a conditional generation. If provided, `n_samples` should not be given. The content of the `pd.DataFrame` must match the context columns provided when defining the `XgbModel`.	`None`
`batch_size`	`int`	Batch size used during generation. If 0, all data is generated in a single batch.	`0`
`temp`	`float`	Temperature parameter for sampling.	`1.0`

Returns:

Type	Description
`RelationalData`	A `RelationalData` object with a single synthetic table.

save

save(path: Path | str) -> None

Save the XgbModel to a checkpoint at the given path.

load `classmethod`

load(path: Path | str) -> XgbModel

Load the XgbModel from the checkpoint at the given path.

Models

TabularModel

build classmethod

generate

predict_proba

predict_sample

save

load classmethod

TextModel

build classmethod

build_from_pretrained classmethod

generate

save

load classmethod

Size

TabularModelSize dataclass

from_size classmethod

TextModelSize dataclass

from_size classmethod

PredProb dataclass

PredSample

select

XgbModel

__init__

fit

train

generate

save

load classmethod

build `classmethod`

load `classmethod`

build `classmethod`

build_from_pretrained `classmethod`

load `classmethod`

TabularModelSize `dataclass`

from_size `classmethod`

TextModelSize `dataclass`

from_size `classmethod`

PredProb `dataclass`

init

load `classmethod`