Skip to content

Preprocessors

TabularPreproc

from_schema classmethod

from_schema(
    schema: Schema,
    ctx_cols: dict[str, Sequence[str]] | None = None,
    preprocessors: dict[
        str, dict[str, ColumnPreproc | ArColumn | None]
    ]
    | None = None,
) -> TabularPreproc

Build a preprocessor for tabular data from the Schema.

Parameters:

Name Type Description Default
schema Schema

A Schema object.

required
ctx_cols dict[str, Sequence[str]] | None

A dictionary with the columns to be used as context. May contain: 1. Only the root table as key and a subset of its columns as value. 2. All the tables as keys and a subset of each table's columns as values.

None
preprocessors dict[str, dict[str, ColumnPreproc | ArColumn | None]] | None

A dictionary containing preprocessing instructions for each column in the schema. Keys are table names, values are dictionaries with column names as keys and preprocessing instructions as values. Preprocessing instructions can be instances of ColumnPreproc, a column preprocessor, or None. If None the default preprocessor will be instantiated based on the Column type defined in the Schema.

None

Returns:

Type Description
TabularPreproc

A TabularPreproc object.

fit

Fit the preprocessor to the given RelationalData.

Parameters:

Name Type Description Default
data RelationalData

The RelationalData to fit the preprocessor to.

required

Returns:

Type Description
TabularPreproc

The fitted TabularPreproc object.

TextPreproc

from_schema_table classmethod

from_schema_table(
    schema: Schema, table: str
) -> TextPreproc

Build a preprocessor for the text columns of a table from the Schema.

Parameters:

Name Type Description Default
schema Schema

A Schema object.

required
table str

Name of the target table in the schema that contains text columns.

required

Returns:

Type Description
TextPreproc

A TextPreproc object.

from_tabular classmethod

from_tabular(
    preproc: TabularPreproc[_AP], table: str
) -> TextPreproc

Build a preprocessor for the text columns of a table from the TabularPreproc used for the tabular data.

Parameters:

Name Type Description Default
preproc TabularPreproc[_AP]

A TabularPreproc object used for the tabular part of the data.

required
table str

Name of the target table in the schema that contains text columns.

required

Returns:

Type Description
TextPreproc

A TextPreproc object.

fit

Fit the preprocessor to the given RelationalData.

Parameters:

Name Type Description Default
data RelationalData

The RelationalData to fit the preprocessor to.

required

Returns:

Type Description
TextPreproc

The fitted TextPreproc object.

ColumnPreproc dataclass

Preprocessing instructions for a column.

Parameters:

Name Type Description Default
special_values Sequence | None

A sequence of special values to handle during preprocessing.

None
impute_nan bool | None

A flag indicating whether to impute NaN values during preprocessing. If True, NaN values will not be sampled during the generation of synthetic data.

None
non_sample_values Sequence | None

A sequence of values that should not be sampled during the generation of synthetic data.

None
protection Protection | bool | None

A Protection object or boolean flag indicating whether to apply protection to the column. If boolean, the default protection is applied, otherwise the Protection object configures the protection.

None