Skip to content

Evaluation

ReportColumns pydantic-config

ReportColumns(
    univ: dict[str, int | set[str]] | int | None = None,
    biv: dict[str, int | set[str]] | int | None = None,
    knn: dict[str, int | set[str]] | int | None = None,
    phik: dict[str, int | set[str]] | int | None = None,
)

The columns to be used in the different sections of the Report. For each section, if the columns to be used are specified by an integer n, the first n columns of each table will be used. Otherwise, if a dict is provided, it should map each table to either the columns to be used, or to an integer n specifying the number of columns. For missing tables all columns will be retained.

Parameters:

Name Type Description Default
univ dict[str, int | set[str]] | int | None

The columns to be used for the univariate distributions.

None
biv dict[str, int | set[str]] | int | None

The columns to be used for the bivariate distributions.

None
knn dict[str, int | set[str]] | int | None

The columns to be used for the k-NN (nearest neighbors) analysis.

None
phik dict[str, int | set[str]] | int | None

The columns to be used for the PhiK analysis.

None

PrivacyStats dataclass

Data structure containing the privacy statistics for a single table.

Attributes:

Name Type Description
privacy_score float

The privacy score.

privacy_score_std float

An estimate of the standard deviation of the privacy score.

risk float

An estimate of the fraction of training points at risk of re-identification.

report

report(
    data_train: RelationalData,
    data_test: RelationalData,
    data_synth: RelationalData,
    path: Path | str,
    n_max_train: int | None = 5000,
    n_max_test: int | None = 5000,
    columns: ReportColumns
    | dict[str, int | Collection[str]]
    | int
    | None = ReportColumns(
        univ=100, biv=20, knn=100, phik=20
    ),
) -> None

Collect summary statistics for the evaluation of synthetic data in terms of data quality and privacy protection.

Parameters:

Name Type Description Default
data_train RelationalData

A RelationalData object containing the original training data.

required
data_test RelationalData

A RelationalData object containing the original test data.

required
data_synth RelationalData

A RelationalData object containing the generated synthetic data.

required
path Path | str

A path to save the report.

required
n_max_train int | None

The maximum number of samples per table (for train data) to use in the report.

5000
n_max_test int | None

The maximum number of samples per table (for test data) to use in the report.

5000
columns ReportColumns | dict[str, int | Collection[str]] | int | None

The columns to use for the computation of the report sections. It can be an instance of ReportColumns, which is a data structure containing the columns to be used in each report section, These can be provided as an int or a dict. Otherwise, it can be an int or a dict, and i this case the same settings will be applied to all sections. For each section, if the columns to be used are specified by an integer n, the first n columns of each table will be used. Otherwise, if a dict is provided, it should map each table to either the columns to be used, or to an integer n specifying the number of columns. For missing tables all columns will be retained. By default, 100 columns are used for the univariate distributions and the k-NN analysis, and 20 for the bivariate distributions and the PhiK analysis.

ReportColumns(univ=100, biv=20, knn=100, phik=20)

compute_privacy_stats

compute_privacy_stats(
    data_train: RelationalData,
    data_synth: RelationalData,
    q: float = 0.1,
    risk_confidence: float = 0.0,
    n_folds_std: int | None = 10,
    n_max: int | None = None,
) -> dict[str, PrivacyStats | None]

Compute privacy statistics for the evaluation of synthetic data.

Parameters:

Name Type Description Default
data_train RelationalData

A RelationalData object containing the original training data.

required
data_synth RelationalData

A RelationalData object containing the generated synthetic data.

required
q float

The quantile used to compute the privacy score and the number of records at risk of re-identification.

0.1
risk_confidence float

A confidence parameter for the estimation of the number of records at risk of re-identification. The estimated number of records at risk (n_risk) is corrected with a factor of -risk_confidence * sqrt(n_risk).

0.0
n_folds_std int | None

Number of folds to use in the computation of the standard deviation, must be larger than 1. If None, the computation is not performed.

10
n_max int | None

The maximum number of samples per table (for both train and synth data) to use in the computation.

None

Returns:

Type Description
dict[str, PrivacyStats | None]

A dictionary mapping each table to a PrivacyStats object (or None in case of error).