Data preprocessing

Data preprocessing means transforming the data columns to make them suitable for model training. This process can include optional steps to reduce the risk of privacy breaches and guarantee data anonymization.

Preprocessing is performed with a TabularPreproc object. To instantiate the default preprocessor, users can pass a Schema object to the TabularPreproc.from_schema() method. After the instantiation, the TabularPreproc object must be fitted on a RelationalData object.

from aindo.rdml.relational import RelationalData
from aindo.rdml.synth import TabularPreproc

data = RelationalData(data=..., schema=...)
preproc = TabularPreproc.from_schema(schema=data.schema)
preproc.fit(data=data)

Users also have the option to specify a custom preprocessing for each column. This can be achieved by passing the preprocessors argument to the TabularPreproc.from_schema() method, The preprocessors parameter is a dictionary where the keys are the names of tables, and the values consist of dictionaries with column names as keys and one of the following values:

A ColumnPreproc object, enabling users to define a custom behavior for that column during the preprocessing step;
A None value tells the preprocessor to ignore that column;
A custom column instance. This option is designed for advanced users seeking access to lower-level functionalities.

The preprocessing of text data is managed by TextPreproc objects, one for each table containing text. The TextPreproc objects need to preprocess also the tabular part of the data, to condition the text during training and generation. In most cases, the generation of the text columns is done in addition to the generation of the rest of the tabular data, and therefore a TabularPreproc object is already available. Each TextPreproc object can then be built using the latter with the TextPreproc.from_tabular() method, also providing the name of the table to consider.

from aindo.rdml.relational import RelationalData
from aindo.rdml.synth import TabularPreproc, TextPreproc

data = RelationalData(data=..., schema=...)
preproc = TabularPreproc.from_schema(schema=data.schema)
preproc.fit(data=data)
preproc_text = TextPreproc.from_tabular(preproc=preproc, table="listings")
preproc_text.fit(data=data)

In case no TabularPreproc object is available, the text preprocessor can also be built from scratch, using the TextPreproc.from_schema_table() method, which requires the Schema and the name of the table containing the text columns.

from aindo.rdml.relational import RelationalData
from aindo.rdml.synth import TextPreproc

data = RelationalData(data=..., schema=...)
preproc_text = TextPreproc.from_schema_table(schema=data.schema, table="listings")
preproc_text.fit(data=data)

To ensure consistency, the first method is recommended when both tabular and text data need to be generated.

It is important to note that custom preprocessing of text columns is not supported.

ColumnPreproc (advanced user)

A ColumnPreproc object offers four optional parameters designed to customize the preprocessing of a column:

special_values: Provide a set of special values that will be treated as separate from the other values of the column, for example in a column with mixed type values.
impute_nan: Force the model to avoid generating missing values in the synthetic data.
non_sample_values: Provide a set of values that will not be generated in the synthetic data.
protection: Add an extra protection from potential privacy leaks coming from rare or extremal values present in the original column data.

In the next subsections, we describe in details the effect of these parameters.

Special values

The parameter special_values takes a list of values that are considered special or unique within the dataset, such as special characters occurring in a numeric column or outliers within a distribution. For instance, in the Airbnb dataset, let us assume that the numerical column price can sometimes assume the non-numerical value "missing". In such case, we might denote this value as special:

from aindo.rdml.synth import ColumnPreproc, TabularPreproc

preproc = TabularPreproc.from_schema(
   schema=...,
   preprocessors={
       "listings": {
           "price": ColumnPreproc(special_values=["missing"]),
       },
    },
)

Imputation of missing values

The parameter impute_nan is a boolean flag that determines whether NaN values within the column should be sampled. When set to True, NaN values are imputed, ensuring that the synthetic data does not include any NaN values. For instance, to avoid sampling NaN values in the price column:

from aindo.rdml.synth import ColumnPreproc, TabularPreproc

preproc = TabularPreproc.from_schema(
  schema=...,
  preprocessors={
      "listings": {
          "price": ColumnPreproc(impute_nan=True),
      },
  },
)

Avoid sampling certain values

The parameter non_sample_values allows the user to set a list of values that will not be sampled during generation, e.g. "Manhattan" and "Brooklyn" in the neighbourhood_group column:

from aindo.rdml.synth import ColumnPreproc, TabularPreproc

preproc = TabularPreproc.from_schema(
  schema=...,
  preprocessors={
      "listings": {
          "neighbourhood_group": ColumnPreproc(non_sample_values=["Manhattan", "Brooklyn"]),
      },
  },
)

In place of these values, some other plausible values of the same column will be sampled when generating synthetic data.

Protection of rare values

The aindo.rdml library provides a range of options to ensure additional privacy protection to extremal or rare values that might be present in the columns. Indeed, despite the model's inability to learn from individual data subjects, it learns the rare categories and the ranges of numerical values, which might in some cases disclose sensitive data in the original dataset.

Consider for example a dataset with a range of information about the employees of a company, including their salaries. Let us say the CEO will have the highest salary in the dataset.

Employee ID	Name	Age	Role	Salary
001	Alice Johnson	60	CEO	$100,000
002	John Smith	32	HR	$55,000
003	Emily Davis	35	Finance	$65,000

A model trained on this dataset will learn the range of values that the Salary column can take. When generating synthetic data, the model may (rarely) generate employees with salaries as high as the CEO one. This extremal values found in the synthetic dataset reveals in fact the salary of the CEO in the original dataset.

Another example can be the one of a dataset containing the patients with a particular pathology. Being able to understand that a specific individual was in the original dataset would constitute a privacy leak for that individual.

Patient ID	Age	ZIP code	Systolic blood pressure (mm Hg)
001	21	34016	116
002	45	38068	125
003	72	00154	110

The ZIP code 34016 is the ZIP code of Monrupino, a small but charming village near Trieste, with less than 1000 inhabitants. If the ZIP code column is defined as categorical, the generative model will memorize the possible values that the column can take, even the rare one like the Monrupino ZIP code. During the generation of synthetic data, a rare ZIP code won't be generated often, however when it is generated it reveals the fact that somebody from Monrupino was in the original dataset. Even if this information does not explicitly disclose who that person is, in the case some other publicly accessible information can be cross-referenced with the generated synthetic data, the identity of that person may be ultimately revealed. In any case, it is clear that the mere presence of a rare category in the generated dataset can disclose more private information than what is intended.

The aindo.rdml library contains a series of tools to remove or mitigate the possibility of these kinds of privacy leaks, and add an extra layer of protection to the specific values present in some column. The problematic values can be detected, and masked from the original dataset, so that the model will never be able to learn them. When generating synthetic data, the sensitive values may be either generated masked, or they may be replaced by other viable, non-sensitive values. All these behaviors can be tuned with the protection parameter of the ColumnPreproc object.

The protection parameter can be either the boolean flag True, indicating the default protection (ColumnPreproc(protection=True)), or a Protection object, with which the user can customize the protection measures.

When configuring a Protection object, three optional arguments can be provided:

detectors, a sequence of Detector objects that perform a detection of values that should be protected, based on the column type and a chosen detection strategy. The full list of the available detectors is provided in the API reference.
default, a boolean flag indicating whether the default protection for that column type should be enabled.
type, a string or a ProtectionType object that describes the protection strategy. This can be either imputation ("impute", ProtectionType.IMPUTE) or masking ("mask",ProtectionType.MASK). Imputation means replacing sensitive values with plausible alternatives within the column. Masking is achieved by replacing sensitive values with placeholders.

For instance, we could use a RareCategoryDetector, that determines the rare categories based on the number of occurrences, and masking strategy on the neighbourhood column as follows:

from aindo.rdml.synth import ColumnPreproc, Protection, RareCategoryDetector, TabularPreproc

preproc = TabularPreproc.from_schema(
    schema=...,
    preprocessors={
        "listings": {
            "neighbourhood": ColumnPreproc(
                protection=Protection(
                    detectors=(RareCategoryDetector(),),
                    type="mask",
                ),
            ),
        },
    },
)

Custom column preprocessors (expert user)

To each Column type presented in this section, the library associates the internal default column preprocessor, which in turn defines how the column data will be preprocessed before being fed to the generative model. The user might prefer to define a different preprocessor than the default one, by means of the preprocessors parameter of the TabularPreproc.from_schema() method.

The available column preprocessors are: Categorical, Coordinates, Date, Datetime, Time, Integer, Numeric, ItaFiscalCode and Text. The table below illustrates the default mappings from column types to column preprocessors

Column type	Default Column Preprocessor
BOOLEAN / CATEGORICAL	Categorical
NUMERIC / INTEGER	Numeric
DATE	Date
TIME	Time
DATETIME	Datetime
COORDINATES	Coordinates
ITAFISCALCODE	ItaFiscalCode
TEXT	Text

Not all column preprocessors are compatible with all kinds of input data. For example, while the Categorical preprocessor can deal with virtually any type of column data, the Datetime preprocessor will raise an error if the input data cannot be interpreted as datetime. Other similar limitations apply to the other column preprocessors.

Column preprocessors may be configured using the arguments: special_values, impute_nan, non_sample_values and protection, common to all columns, plus the specific arguments available to each one. All the available parameters to each column preprocessor are listed in the API reference.

For instance, the user might want to preprocess the minimum_nights column with a Categorical preprocessor, instead of the default Numeric:

from aindo.rdml.synth import Categorical, TabularPreproc

preproc = TabularPreproc.from_schema(
    schema=...,
    preprocessors={
        "listings": {
            "minimum_nights": Categorical(),
        },
    },
)