Skip to content

Models

TabularModel

build classmethod

build(
    preproc: TabularPreproc,
    size: str | Size | TabularModelSize,
    block: str | None = None,
    dropout: float | None = 0.12,
) -> TabularModel

Tabular model to generate synthetic tabular relational data.

Parameters:

Name Type Description Default
preproc TabularPreproc

A TabularPreproc object.

required
size str | Size | TabularModelSize

The size configuration of the model. Could be either a TabularModelSize object, a Size object, or a string representation of the latter.

required
block str | None

The block type. The possible values depend on whether the data is single table or multi table. For a single table, either 'free' (default), 'causal', or 'lstm'. For multi table data, either 'free' (default) or 'lstm'.

None
dropout float | None

The dropout probability.

0.12

Returns:

Type Description
TabularModel

A TabularModel instance.

generate

generate(
    n_samples: int | None = None,
    ctx: dict[str, DataFrame] | None = None,
    batch_size: int = 0,
    max_block_size: int = 0,
    temp: float = 1.0,
) -> RelationalData

Generate synthetic relational data.

Parameters:

Name Type Description Default
n_samples int | None

Desired number of samples in the root table. Must be given if and only if ctx is not provided.

None
ctx dict[str, DataFrame] | None

The columns of the context from where to start a conditional generation. If provided, n_samples should not be given. The content of the pd.DataFrame's must match the context columns provided to the TabularPreproc: 1. If only a subset of the root table's columns are provided, the model will generate the foreign keys. 2. If a subset of columns for each table are provided, each pd.DataFrame must also contain the primary and foreign keys, and the generated synthetic data will have the same keys provided by the context. The foreign keys referring to lookup tables should be treated as feature columns, not as foreign keys.

None
batch_size int

Batch size used during generation. If 0, all data is generated in a single batch.

0
max_block_size int

Maximum length for each generated sample. Active only for multi-table datasets and for generation from the root table (denoted above as 1.). If 0, no limit is enforced.

0
temp float

Temperature parameter for sampling.

1.0

Returns:

Type Description
RelationalData

A RelationalData object with the generated synthetic tabular data.

save

save(path: Path | str) -> None

Save the TabularModel to a checkpoint at the given path.

load classmethod

load(path: Path | str) -> TabularModel

Load the TabularModel from the checkpoint at the given path.

TextModel

build classmethod

build(
    preproc: TextPreproc,
    size: str | Size | TextModelSize,
    block_size: int,
    dropout: float | None = 0.12,
) -> TextModel

Text model to generate synthetic text columns of a table which is part of a relational structure.

Parameters:

Name Type Description Default
preproc TextPreproc

A TextPreproc object.

required
size str | Size | TextModelSize

The size configuration of the model. Could be either a Size object, a TextModelSize object or a string representation of such objects.

required
block_size int

Maximum text sequence length that the model can process.

required
dropout float | None

The dropout probability.

0.12

Returns:

Type Description
TextModel

A TextModel instance.

build_from_pretrained classmethod

build_from_pretrained(
    preproc: TextPreproc,
    path: Path | str,
    block_size: int | None = None,
) -> TextModel

Build a text model from a pretrained model.

Parameters:

Name Type Description Default
preproc TextPreproc

A TextPreproc object.

required
path Path | str

The path to the checkpoint of the pre-trained model.

required
block_size int | None

Maximum text sequence length that the model can process during fine-tuning.

None

Returns:

Type Description
TextModel

A TextModel instance with the weights loaded from the pre-trained model.

generate

generate(
    data: RelationalData,
    batch_size: int = 0,
    max_text_len: int = 0,
    temp: float = 1.0,
) -> RelationalData

Generate text columns in the current table.

Parameters:

Name Type Description Default
data RelationalData

A RelationalData object containing synthetic data.

required
batch_size int

Batch size used during generation. If 0, generate all data in a single batch.

0
max_text_len int

Maximum length for the generated text. If 0, the maximum possible value is used, namely the value of the TabularModel.max_block_size attribute.

0
temp float

Temperature parameter for sampling.

1.0

Returns:

Type Description
RelationalData

A RelationalData object with the generated synthetic text data.

save

save(path: Path | str) -> None

Save the TextModel to a checkpoint at the given path.

load classmethod

load(path: Path | str) -> TextModel

Load the TextModel from the checkpoint at the given path.

Size

Enumeration class representing different model sizes. Supported sizes are: SMALL, MEDIUM and LARGE.

TabularModelSize dataclass

Model size for TabularModel objects.

Parameters:

Name Type Description Default
n_layers int

Number of internal layers.

required
h int

Number of heads.

required
d int

Size of the internal dimension.

required

from_size classmethod

from_size(size: Size | str) -> TabularModelSize

Create an instance based on a given Size or its string representation.

Parameters:

Name Type Description Default
size Size | str

A Size object.

required

TextModelSize dataclass

Model size for TextModel objects.

Parameters:

Name Type Description Default
n_layers int

Number of internal layers.

required
h int

Number of heads.

required
d int

Size of the internal dimension.

required

from_size classmethod

from_size(size: Size | str) -> TextModelSize

Create an instance based on a given Size or its string representation.

Parameters:

Name Type Description Default
size Size | str

A Size or a str representing a Size.

required

XgbModel

__init__

__init__(
    schema: Schema,
    overwrites: dict[str, ArColumn | None] | None = None,
    ctx_cols: Sequence[str] = (),
    n_estimators: int | None = 1000,
    valid_frac: float | None = 0.0,
    **kwargs: Any,
) -> None

A generative model based on autoregressive XGBoost models. Can be used only with single-table data.

Parameters:

Name Type Description Default
schema Schema

The Schema of the data.

required
overwrites dict[str, ArColumn | None] | None

Overwrites to the table preprocessor, in the form of a dictionary with keys the column names and values the column preprocessors.

None
ctx_cols Sequence[str]

The columns to be used as context.

()
n_estimators int | None

Number of estimators for the XGBoost models.

1000
valid_frac float | None

Fraction of the training data to be used for validation.

0.0
**kwargs Any

Keyword arguments to be passed to the XGBoost models.

{}

fit

fit(data: RelationalData) -> XgbModel

Fit the XgbModel to the given RelationalData.

Parameters:

Name Type Description Default
data RelationalData

The RelationalData to fit the XgbModel to.

required

Returns:

Type Description
XgbModel

The fitted instance of the XgbModel.

train

train(data: RelationalData) -> XgbModel

Train the XgbModel with the input RelationalData.

Parameters:

Name Type Description Default
data RelationalData

The training data, as a RelationalData object.

required

Returns:

Type Description
XgbModel

The trained instance of the XgbModel.

generate

generate(
    n_samples: int | None = None,
    ctx: DataFrame | None = None,
) -> RelationalData

Generate synthetic data.

Parameters:

Name Type Description Default
n_samples int | None

The desired number of samples. Must be given if and only if ctx is not provided.

None
ctx DataFrame | None

The columns of the context from where to start a conditional generation. If provided, n_samples should not be given. The content of the pd.DataFrame must match the context columns provided when defining the XgbModel.

None

Returns:

Type Description
RelationalData

A RelationalData object with a single synthetic table.

save

save(path: Path | str) -> None

Save the XgbModel to a checkpoint at the given path.

load classmethod

load(path: Path | str) -> XgbModel

Load the XgbModel from the checkpoint at the given path.