Skip to content

Preprocessors

TabularPreproc

from_schema classmethod

from_schema(
    schema: Schema,
    ctx_cols: dict[str, Sequence[str]] | None = None,
    preprocessors: dict[
        str, dict[str, ColumnPreproc | ArColumn | None]
    ]
    | None = None,
) -> TabularPreproc

Build a preprocessor for tabular data from the Schema.

Parameters:

Name Type Description Default
schema Schema

A Schema object.

required
ctx_cols dict[str, Sequence[str]] | None

A dictionary with the columns to be used as context. May contain: 1. Only the root table as key and a subset of its columns as value. 2. All the tables as keys and a subset of each table's columns as values.

None
preprocessors dict[str, dict[str, ColumnPreproc | ArColumn | None]] | None

A dictionary containing preprocessing instructions for each column in the schema. Keys are table names, values are dictionaries with column names as keys and preprocessing instructions as values. Preprocessing instructions can be instances of ColumnPreproc, a column preprocessor, or None. If None, the column will be ignored. For the columns for which a preprocessor is not provided, the default preprocessor will be instantiated based on the Column type defined in the Schema.

None

Returns:

Type Description
TabularPreproc

A TabularPreproc object.

fit

Fit the preprocessor to the given RelationalData.

Parameters:

Name Type Description Default
data RelationalData

The RelationalData to fit the preprocessor to.

required

Returns:

Type Description
TabularPreproc

The fitted TabularPreproc object.

select_ctx

select_ctx(
    data: RelationalData, idx: Sequence[int] | None = None
) -> RelationalData

Select the context from the input data.

Parameters:

Name Type Description Default
data RelationalData

The RelationalData from which to extract the context.

required
idx Sequence[int] | None

The indices of the root table to select. May be repeated. If None, all indices will be taken once and the keys will be kept. Otherwise, the keys will be reset.

None

Returns:

Type Description
RelationalData

A RelationalData with the selected context.

sample_ctx

sample_ctx(
    data: RelationalData,
    n_samples: int | None = None,
    rng: Generator | int | None = None,
) -> RelationalData

Sample the context from th einput data.

Parameters:

Name Type Description Default
data RelationalData

The RelationalData from which to extract the context.

required
n_samples int | None

The number of context samples. If None, the number of samples will be equal to the number of samples in the root table.

None
rng Generator | int | None

A np.random.Generator or an integer seed to control the randomness during sampling. If None, a random seed is generated.

None

Returns:

Type Description
RelationalData

A RelationalData with the sampled context.

TextPreproc

from_schema_table classmethod

from_schema_table(
    schema: Schema, table: str
) -> TextPreproc

Build a preprocessor for the text columns of a table from the Schema.

Parameters:

Name Type Description Default
schema Schema

A Schema object.

required
table str

Name of the target table in the schema that contains text columns.

required

Returns:

Type Description
TextPreproc

A TextPreproc object.

from_tabular classmethod

from_tabular(
    preproc: TabularPreproc[_AP], table: str
) -> TextPreproc

Build a preprocessor for the text columns of a table from the TabularPreproc used for the tabular data.

Parameters:

Name Type Description Default
preproc TabularPreproc[_AP]

A TabularPreproc object used for the tabular part of the data.

required
table str

Name of the target table in the schema that contains text columns.

required

Returns:

Type Description
TextPreproc

A TextPreproc object.

fit

Fit the preprocessor to the given RelationalData.

Parameters:

Name Type Description Default
data RelationalData

The RelationalData to fit the preprocessor to.

required

Returns:

Type Description
TextPreproc

The fitted TextPreproc object.

ColumnPreproc dataclass

Preprocessing instructions for a column.

Parameters:

Name Type Description Default
special_values Sequence | None

A sequence of special values to handle during preprocessing.

None
impute_nan bool | None

A flag indicating whether to impute NaN values during preprocessing. If True, NaN values will not be sampled during the generation of synthetic data.

None
non_sample_values Sequence | None

A sequence of values that should not be sampled during the generation of synthetic data.

None
protection Protection | bool | None

A Protection object or boolean flag indicating whether to apply protection to the column. If boolean, the default protection is applied, otherwise the Protection object configures the protection.

None