Skip to content

Evaluation

ReportColumns dataclass

The columns to be used in the different sections of the Report. For each section, if the columns to be used are specified by an integer n, the first n columns of each table will be used. Otherwise, if a dict is provided, it should map each table to either the columns to be used, or to an integer n specifying the number of columns. For missing tables all columns will be retained.

Parameters:

Name Type Description Default
univ dict[str, int | Collection[str]] | int

The columns to be used for the univariate distributions.

dict()
biv dict[str, int | Collection[str]] | int

The columns to be used for the bivariate distributions.

dict()
knn dict[str, int | Collection[str]] | int

The columns to be used for the k-NN (nearest neighbors) analysis.

dict()
phik dict[str, int | Collection[str]] | int

The columns to be used for the PhiK analysis.

dict()

PrivacyStats dataclass

Data structure containing the privacy statistics for a single table.

Attributes:

Name Type Description
privacy_score float

The privacy score.

privacy_score_std float

An estimate of the standard deviation of the privacy score.

risk float

An estimate of the fraction of training points at risk of re-identification.

report

report(
    data_train: RelationalData,
    data_test: RelationalData,
    data_synth: RelationalData,
    path: Path | str,
    n_max_train: int | None = 5000,
    n_max_test: int | None = 5000,
    columns: dict[str, int | Collection[str]] | int | ReportColumns = ReportColumns(
        univ=100, biv=20, knn=100, phik=20
    ),
) -> None

Collect summary statistics for the evaluation of synthetic data in terms of data quality and privacy protection.

Parameters:

Name Type Description Default
data_train RelationalData

A RelationalData object containing the original training data.

required
data_test RelationalData

A RelationalData object containing the original test data.

required
data_synth RelationalData

A RelationalData object containing the generated synthetic data.

required
path Path | str

A path to save the report.

required
n_max_train int | None

The maximum number of samples per table (for train data) to use in the report.

5000
n_max_test int | None

The maximum number of samples per table (for test data) to use in the report.

5000
columns dict[str, int | Collection[str]] | int | ReportColumns

The columns to use for the computation of the report sections. It can be an instance of ReportColumns, which is a data structure containing the columns to be used in each report section, These can be provided as an int or a dict. Otherwise, it can be an int or a dict, and i this case the same settings will be applied to all sections. For each section, if the columns to be used are specified by an integer n, the first n columns of each table will be used. Otherwise, if a dict is provided, it should map each table to either the columns to be used, or to an integer n specifying the number of columns. For missing tables all columns will be retained. By default, 100 columns are used for the univariate distributions and the k-NN analysis, and 20 for the bivariate distributions and the PhiK analysis.

ReportColumns(univ=100, biv=20, knn=100, phik=20)

compute_privacy_stats

compute_privacy_stats(
    data_train: RelationalData,
    data_synth: RelationalData,
    q: float = 0.1,
    risk_confidence: float = 0.0,
    n_folds_std: int | None = 10,
    n_max: int | None = None,
) -> dict[str, PrivacyStats | None]

Compute privacy statistics for the evaluation of synthetic data.

Parameters:

Name Type Description Default
data_train RelationalData

A RelationalData object containing the original training data.

required
data_synth RelationalData

A RelationalData object containing the generated synthetic data.

required
q float

The quantile used to compute the privacy score and the number of records at risk of re-identification.

0.1
risk_confidence float

A confidence parameter for the estimation of the number of records at risk of re-identification. The estimated number of records at risk (n_risk) is corrected with a factor of -risk_confidence * sqrt(n_risk).

0.0
n_folds_std int | None

Number of folds to use in the computation of the standard deviation, must be larger than 1. If None, the computation is not performed.

10
n_max int | None

The maximum number of samples per table (for both train and synth data) to use in the computation.

None

Returns:

Type Description
dict[str, PrivacyStats | None]

A dictionary mapping each table to a PrivacyStats object (or None in case of error).