Skip to content

Models

TabularModel

build classmethod

build(
    preproc: TabularPreproc,
    size: str | Size | TabularModelSize,
    block: str | None = None,
    dropout: float | None = 0.12,
) -> TabularModel

Tabular model to generate synthetic tabular relational data.

Parameters:

Name Type Description Default
preproc TabularPreproc

A TabularPreproc object.

required
size str | Size | TabularModelSize

The size configuration of the model. Could be either a TabularModelSize object, a Size object, or a string representation of the latter.

required
block str | None

The block type. The possible values depend on whether the data is single table or multi table. For a single table, either 'free' (default), 'causal', or 'lstm'. For multi table data, either 'free' (default) or 'lstm'.

None
dropout float | None

The dropout probability.

0.12

Returns:

Type Description
TabularModel

A TabularModel instance.

generate

generate(
    n_samples: int | None = None,
    ctx: dict[str, DataFrame] | None = None,
    batch_size: int = 0,
    max_block_size: int = 0,
    temp: float = 1.0,
) -> RelationalData

Generate synthetic relational data.

Parameters:

Name Type Description Default
n_samples int | None

Desired number of samples in the root table. Must be given if and only if ctx is not provided.

None
ctx dict[str, DataFrame] | None

The context from where to start a conditional generation. If provided, n_samples should not be given. The content of the pd.DataFrame's must match the context columns provided to the TabularPreproc: 1. If only a subset of the root table's columns are provided, the model will generate the foreign keys. 2. If a subset of columns for each table are provided, each pd.DataFrame must also contain the primary and foreign keys, and the generated synthetic data will have the same keys provided by the context. The foreign keys referring to lookup tables should be treated as feature columns, not as foreign keys.

None
batch_size int

Batch size used during generation. If 0, all data is generated in a single batch.

0
max_block_size int

Maximum length for each generated sample. Active only for multi-table datasets and for generation from the root table (denoted above as 1.). If 0, no limit is enforced.

0
temp float

Temperature parameter for sampling.

1.0

Returns:

Type Description
RelationalData

A RelationalData object with the generated synthetic tabular data.

predict_proba

predict_proba(
    ctx: dict[str, DataFrame], batch_size: int = 0
) -> PredProb

Predict probabilities for each category of a single categorical column. In order to use this function, the context must contain all columns except for a single categorical column in the root table.

Parameters:

Name Type Description Default
ctx dict[str, DataFrame]

The input context. Must contain all columns except for a single categorical column in the root table.

required
batch_size int

Batch size used during prediction. If 0, all predictions are performed in a single batch.

0

Returns:

Type Description
PredProb

A PredProb object, containing the predicted probabilities and the corresponding categories.

predict_sample

predict_sample(
    ctx: dict[str, DataFrame],
    n: int,
    batch_size: int = 0,
    temp: float = 1.0,
    rng: Generator | int | None = None,
) -> PredSample

Make n prediction samples from a given context.

Parameters:

Name Type Description Default
ctx dict[str, DataFrame]

The context from where to compute the n prediction samples.

required
n int

The number of prediction samples.

required
batch_size int

Batch size used during prediction. If 0, all predictions are performed in a single batch.

0
temp float

Temperature parameter for sampling.

1.0
rng Generator | int | None

A torch.Generator or an integer seed to control the randomness in the sample. If None, a random seed is generated.

None

Returns:

Type Description
PredSample

A PredSample object, namely a list of the predicted RelationalData (containing only

PredSample

the predicted columns). The common context can be retrieved from the PredSample.ctx attribute.

save

save(path: Path | str) -> None

Save the TabularModel to a checkpoint at the given path.

Parameters:

Name Type Description Default
path Path | str

The path where to sve the checkpoint.

required

load classmethod

load(path: Path | str) -> TabularModel

Load the TabularModel from the checkpoint at the given path.

Parameters:

Name Type Description Default
path Path | str

The path to the loaded checkpoint.

required

Returns:

Type Description
TabularModel

The loaded TabularModel.

TextModel

build classmethod

build(
    preproc: TextPreproc,
    size: str | Size | TextModelSize,
    block_size: int,
    dropout: float | None = 0.12,
) -> TextModel

Text model to generate synthetic text columns of a table which is part of a relational structure.

Parameters:

Name Type Description Default
preproc TextPreproc

A TextPreproc object.

required
size str | Size | TextModelSize

The size configuration of the model. Could be either a Size object, a TextModelSize object or a string representation of such objects.

required
block_size int

Maximum text sequence length that the model can process.

required
dropout float | None

The dropout probability.

0.12

Returns:

Type Description
TextModel

A TextModel instance.

build_from_pretrained classmethod

build_from_pretrained(
    preproc: TextPreproc,
    path: Path | str,
    block_size: int | None = None,
) -> TextModel

Build a text model from a pretrained model.

Parameters:

Name Type Description Default
preproc TextPreproc

A TextPreproc object.

required
path Path | str

The path to the checkpoint of the pre-trained model.

required
block_size int | None

Maximum text sequence length that the model can process during fine-tuning.

None

Returns:

Type Description
TextModel

A TextModel instance with the weights loaded from the pre-trained model.

generate

generate(
    data: RelationalData,
    batch_size: int = 0,
    max_text_len: int = 0,
    temp: float = 1.0,
) -> RelationalData

Generate text columns in the current table.

Parameters:

Name Type Description Default
data RelationalData

A RelationalData object containing synthetic data.

required
batch_size int

Batch size used during generation. If 0, generate all data in a single batch.

0
max_text_len int

Maximum length for the generated text. If 0, the maximum possible value is used, namely the value of the TabularModel.max_block_size attribute.

0
temp float

Temperature parameter for sampling.

1.0

Returns:

Type Description
RelationalData

A RelationalData object with the generated synthetic text data.

save

save(path: Path | str) -> None

Save the TextModel to a checkpoint at the given path.

Parameters:

Name Type Description Default
path Path | str

The path where to sve the checkpoint.

required

load classmethod

load(path: Path | str) -> TextModel

Load the TextModel from the checkpoint at the given path.

Parameters:

Name Type Description Default
path Path | str

The path to the loaded checkpoint.

required

Returns:

Type Description
TextModel

The loaded TextModel.

Size

Enumeration class representing different model sizes. Supported sizes are: SMALL, MEDIUM and LARGE.

TabularModelSize dataclass

Model size for TabularModel objects.

Parameters:

Name Type Description Default
n_layers int

Number of internal layers.

required
h int

Number of heads.

required
d int

Size of the internal dimension.

required

from_size classmethod

from_size(size: Size | str) -> TabularModelSize

Create an instance based on a given Size or its string representation.

Parameters:

Name Type Description Default
size Size | str

A Size object.

required

TextModelSize dataclass

Model size for TextModel objects.

Parameters:

Name Type Description Default
n_layers int

Number of internal layers.

required
h int

Number of heads.

required
d int

Size of the internal dimension.

required

from_size classmethod

from_size(size: Size | str) -> TextModelSize

Create an instance based on a given Size or its string representation.

Parameters:

Name Type Description Default
size Size | str

A Size or a str representing a Size.

required

PredProb dataclass

The predicted probabilities for a single categorical column.

Attributes:

Name Type Description
prob Tensor

A torch.Tensor with the predicted probabilities, of shape (n_samples, n_categories).

categories list[str]

The categories to which the predicted probabilities correspond to.

PredSample

The predicted samples. A list of RelationalData objects containing the predicted columns. It supports the list method list.append and the + operator.

Attributes:

Name Type Description
ctx RelationalData

A RelationalData with the context used for the predictions.

n_samples int | None

The number of samples in each prediction.

schema Schema | None

The Schema of each prediction. It does not contain the context columns.

select

select(idx: Sequence[int]) -> PredSample

Select the predictions corresponding to the input indices.

Parameters:

Name Type Description Default
idx Sequence[int]

A Sequence of indices corresponding to the samples to select.

required

Returns:

Type Description
PredSample

A PredSample object with the selected predictions.

XgbModel

__init__

__init__(
    schema: Schema,
    ctx_cols: Sequence[str] = (),
    preprocessors: dict[
        str, ColumnPreproc | ArColumn | None
    ]
    | None = None,
    n_estimators: int | None = 1000,
    valid_frac: float | None = 0.0,
    **kwargs: Any,
) -> None

A generative model based on autoregressive XGBoost models. Can be used only with single-table data.

Parameters:

Name Type Description Default
schema Schema

The Schema of the data.

required
ctx_cols Sequence[str]

The columns to be used as context.

()
preprocessors dict[str, ColumnPreproc | ArColumn | None] | None

A dictionary containing preprocessing instructions for each column in the table. Preprocessing instructions can be instances of ColumnPreproc, a column preprocessor, or None. If None, the column will be ignored. For the columns for which a preprocessor is not provided, the default preprocessor will be instantiated based on the Column type defined in the Schema.

None
n_estimators int | None

Number of estimators for the XGBoost models.

1000
valid_frac float | None

Fraction of the training data to be used for validation.

0.0
**kwargs Any

Keyword arguments to be passed to the XGBoost models.

{}

fit

fit(data: RelationalData) -> XgbModel

Fit the XgbModel to the given RelationalData.

Parameters:

Name Type Description Default
data RelationalData

The RelationalData to fit the XgbModel to.

required

Returns:

Type Description
XgbModel

The fitted instance of the XgbModel.

train

train(data: RelationalData) -> XgbModel

Train the XgbModel with the input RelationalData.

Parameters:

Name Type Description Default
data RelationalData

The training data, as a RelationalData object.

required

Returns:

Type Description
XgbModel

The trained instance of the XgbModel.

generate

generate(
    n_samples: int | None = None,
    ctx: DataFrame | None = None,
    batch_size: int = 0,
    temp: float = 1.0,
) -> RelationalData

Generate synthetic data.

Parameters:

Name Type Description Default
n_samples int | None

The desired number of samples. Must be given if and only if ctx is not provided.

None
ctx DataFrame | None

The columns of the context from where to start a conditional generation. If provided, n_samples should not be given. The content of the pd.DataFrame must match the context columns provided when defining the XgbModel.

None
batch_size int

Batch size used during generation. If 0, all data is generated in a single batch.

0
temp float

Temperature parameter for sampling.

1.0

Returns:

Type Description
RelationalData

A RelationalData object with a single synthetic table.

save

save(path: Path | str) -> None

Save the XgbModel to a checkpoint at the given path.

load classmethod

load(path: Path | str) -> XgbModel

Load the XgbModel from the checkpoint at the given path.