Data preprocessing
Data preprocessing means transforming the data columns to make them suitable for model training. This process can include optional steps to reduce the risk of privacy breaches and guarantee data anonymization.
Preprocessing is performed with a TabularPreproc
object.
To instantiate the default preprocessor, users can pass a Schema
object
to the TabularPreproc.from_schema()
method.
After the instantiation, the TabularPreproc
object must be fitted
on a RelationalData
object.
from aindo.rdml.relational import RelationalData
from aindo.rdml.synth import TabularPreproc
data = RelationalData(data=..., schema=...)
preproc = TabularPreproc.from_schema(schema=data.schema)
preproc.fit(data=data)
Users also have the option to specify a custom preprocessing for each column.
This can be achieved by passing the preprocessors
argument
to the TabularPreproc.from_schema()
method,
The preprocessors
parameter is a dictionary where the keys are the names of tables,
and the values consist of dictionaries with column names as keys and one of the following values:
- A
ColumnPreproc
object, enabling users to define a custom behavior for that column during the preprocessing step; - A
None
value tells the preprocessor to ignore that column; - A custom column instance. This option is designed for advanced users seeking access to lower-level functionalities.
The preprocessing of text data is managed by TextPreproc
objects,
one for each table containing text.
The TextPreproc
objects need to preprocess also the tabular part of the data,
to condition the text during training and generation.
In most cases, the generation of the text columns is done in addition to the generation
of the rest of the tabular data, and therefore a TabularPreproc
object
is already available.
Each TextPreproc
object can then be built using the latter
with the TextPreproc.from_tabular()
method,
also providing the name of the table to consider.
from aindo.rdml.relational import RelationalData
from aindo.rdml.synth import TabularPreproc, TextPreproc
data = RelationalData(data=..., schema=...)
preproc = TabularPreproc.from_schema(schema=data.schema)
preproc.fit(data=data)
preproc_text = TextPreproc.from_tabular(preproc=preproc, table="listings")
preproc_text.fit(data=data)
In case no TabularPreproc
object is available, the text preprocessor
can also be built from scratch, using the
TextPreproc.from_schema_table()
method,
which requires the Schema
and the name of the table containing the text columns.
from aindo.rdml.relational import RelationalData
from aindo.rdml.synth import TextPreproc
data = RelationalData(data=..., schema=...)
preproc_text = TextPreproc.from_schema_table(schema=data.schema, table="listings")
preproc_text.fit(data=data)
To ensure consistency, the first method is recommended when both tabular and text data need to be generated.
It is important to note that custom preprocessing of text columns is not supported.
ColumnPreproc (advanced user)
A ColumnPreproc
object offers four optional parameters designed
to customize the preprocessing of a column:
special_values
: Provide a set of special values that will be treated as separate from the other values of the column, for example in a column with mixed type values.impute_nan
: Force the model to avoid generating missing values in the synthetic data.non_sample_values
: Provide a set of values that will not be generated in the synthetic data.protection
: Add an extra protection from potential privacy leaks coming from rare or extremal values present in the original column data.
In the next subsections, we describe in details the effect of these parameters.
Special values
The parameter special_values
takes a list of values that are considered special or unique within the dataset,
such as special characters occurring in a numeric column or outliers within a distribution.
For instance, in the Airbnb dataset, let us assume that the numerical column price
can sometimes assume
the non-numerical value "missing"
.
In such case, we might denote this value as special:
from aindo.rdml.synth import ColumnPreproc, TabularPreproc
preproc = TabularPreproc.from_schema(
schema=...,
preprocessors={
"listings": {
"price": ColumnPreproc(special_values=["missing"]),
},
},
)
Imputation of missing values
The parameter impute_nan
is a boolean flag that determines whether NaN values within the column should be sampled.
When set to True
, NaN values are imputed, ensuring that the synthetic data does not include any NaN values.
For instance, to avoid sampling NaN values in the price
column:
from aindo.rdml.synth import ColumnPreproc, TabularPreproc
preproc = TabularPreproc.from_schema(
schema=...,
preprocessors={
"listings": {
"price": ColumnPreproc(impute_nan=True),
},
},
)
Avoid sampling certain values
The parameter non_sample_values
allows the user to set a list of values that will not be sampled during generation,
e.g. "Manhattan"
and "Brooklyn"
in the neighbourhood_group
column:
from aindo.rdml.synth import ColumnPreproc, TabularPreproc
preproc = TabularPreproc.from_schema(
schema=...,
preprocessors={
"listings": {
"neighbourhood_group": ColumnPreproc(non_sample_values=["Manhattan", "Brooklyn"]),
},
},
)
In place of these values, some other plausible values of the same column will be sampled when generating synthetic data.
Protection of rare values
The aindo.rdml
library provides a range of options to ensure additional privacy protection
to extremal or rare values that might be present in the columns.
Indeed, despite the model's inability to learn from individual data subjects,
it learns the rare categories and the ranges of numerical values,
which might in some cases disclose sensitive data in the original dataset.
Consider for example a dataset with a range of information about the employees of a company, including their salaries. Let us say the CEO will have the highest salary in the dataset.
Employee ID | Name | Age | Role | Salary |
---|---|---|---|---|
001 | Alice Johnson | 60 | CEO | $100,000 |
002 | John Smith | 32 | HR | $55,000 |
003 | Emily Davis | 35 | Finance | $65,000 |
A model trained on this dataset will learn the range of values that the Salary
column can take.
When generating synthetic data, the model may (rarely) generate employees with salaries as high as the CEO one.
This extremal values found in the synthetic dataset reveals in fact the salary of the CEO in the original dataset.
Another example can be the one of a dataset containing the patients with a particular pathology. Being able to understand that a specific individual was in the original dataset would constitute a privacy leak for that individual.
Patient ID | Age | ZIP code | Systolic blood pressure (mm Hg) |
---|---|---|---|
001 | 21 | 34016 | 116 |
002 | 45 | 38068 | 125 |
003 | 72 | 00154 | 110 |
The ZIP code 34016 is the ZIP code of Monrupino, a small but charming village near Trieste,
with less than 1000 inhabitants.
If the ZIP code
column is defined as categorical, the generative model will memorize the possible values
that the column can take, even the rare one like the Monrupino ZIP code.
During the generation of synthetic data, a rare ZIP code won't be generated often,
however when it is generated it reveals the fact that somebody from Monrupino was in the original dataset.
Even if this information does not explicitly disclose who that person is, in the case some other publicly
accessible information can be cross-referenced with the generated synthetic data, the identity of that person
may be ultimately revealed.
In any case, it is clear that the mere presence of a rare category in the generated dataset can disclose more
private information than what is intended.
The aindo.rdml
library contains a series of tools to remove or mitigate the possibility of these kinds
of privacy leaks, and add an extra layer of protection to the specific values present in some column.
The problematic values can be detected, and masked from the original dataset, so that the model will never
be able to learn them.
When generating synthetic data, the sensitive values may be either generated masked,
or they may be replaced by other viable, non-sensitive values.
All these behaviors can be tuned with the protection
parameter
of the ColumnPreproc
object.
The protection
parameter can be either the boolean flag True
, indicating the default
protection (ColumnPreproc(protection=True)
), or a Protection
object,
with which the user can customize the protection measures.
When configuring a Protection
object, three optional arguments can be provided:
detectors
, a sequence ofDetector
objects that perform a detection of values that should be protected, based on the column type and a chosen detection strategy. The full list of the available detectors is provided in the API reference.default
, a boolean flag indicating whether the default protection for that column type should be enabled.type
, a string or aProtectionType
object that describes the protection strategy. This can be either imputation ("impute"
,ProtectionType.IMPUTE
) or masking ("mask"
,ProtectionType.MASK
). Imputation means replacing sensitive values with plausible alternatives within the column. Masking is achieved by replacing sensitive values with placeholders.
For instance, we could use a RareCategoryDetector
, that determines
the rare categories based on the number of occurrences, and masking strategy on the neighbourhood
column as follows:
from aindo.rdml.synth import ColumnPreproc, Protection, RareCategoryDetector, TabularPreproc
preproc = TabularPreproc.from_schema(
schema=...,
preprocessors={
"listings": {
"neighbourhood": ColumnPreproc(
protection=Protection(
detectors=(RareCategoryDetector(),),
type="mask",
),
),
},
},
)
Custom column preprocessors (expert user)
To each Column
type presented in this section,
the library associates the internal default column preprocessor,
which in turn defines how the column data will be preprocessed before being fed to the generative model.
The user might prefer to define a different preprocessor than the default one, by means of the
preprocessors
parameter of the TabularPreproc.from_schema()
method.
The available column preprocessors are: Categorical
,
Coordinates
, Date
,
Datetime
, Time
, Integer
,
Numeric
, ItaFiscalCode
and Text
.
The table below illustrates the default mappings from column types to column preprocessors
Column type | Default Column Preprocessor |
---|---|
BOOLEAN / CATEGORICAL | Categorical |
NUMERIC / INTEGER | Numeric |
DATE | Date |
TIME | Time |
DATETIME | Datetime |
COORDINATES | Coordinates |
ITAFISCALCODE | ItaFiscalCode |
TEXT | Text |
Not all column preprocessors are compatible with all kinds of input data.
For example, while the Categorical
preprocessor can deal with virtually
any type of column data, the Datetime
preprocessor will raise an error
if the input data cannot be interpreted as datetime.
Other similar limitations apply to the other column preprocessors.
Column preprocessors may be configured using the arguments:
special_values
, impute_nan
, non_sample_values
and protection
, common to all columns,
plus the specific arguments available to each one.
All the available parameters to each column preprocessor are listed in the
API reference.
For instance, the user might want to preprocess the minimum_nights
column with
a Categorical
preprocessor,
instead of the default Numeric
:
from aindo.rdml.synth import Categorical, TabularPreproc
preproc = TabularPreproc.from_schema(
schema=...,
preprocessors={
"listings": {
"minimum_nights": Categorical(),
},
},
)