Evaluation
ReportColumns
dataclass
The columns to be used in the different sections of the Report
.
For each section, if the columns to be used are specified by an integer n
,
the first n
columns of each table will be used.
Otherwise, if a dict is provided, it should map each table to either the columns to be used,
or to an integer n
specifying the number of columns. For missing tables all columns will be retained.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
univ
|
dict[str, int | Collection[str]] | int
|
The columns to be used for the univariate distributions. |
dict()
|
biv
|
dict[str, int | Collection[str]] | int
|
The columns to be used for the bivariate distributions. |
dict()
|
knn
|
dict[str, int | Collection[str]] | int
|
The columns to be used for the k-NN (nearest neighbors) analysis. |
dict()
|
phik
|
dict[str, int | Collection[str]] | int
|
The columns to be used for the PhiK analysis. |
dict()
|
PrivacyStats
dataclass
Data structure containing the privacy statistics for a single table.
Attributes:
Name | Type | Description |
---|---|---|
privacy_score |
float
|
The privacy score. |
privacy_score_std |
float
|
An estimate of the standard deviation of the privacy score. |
risk |
float
|
An estimate of the fraction of training points at risk of re-identification. |
report
report(
data_train: RelationalData,
data_test: RelationalData,
data_synth: RelationalData,
path: Path | str,
n_max_train: int | None = 5000,
n_max_test: int | None = 5000,
columns: dict[str, int | Collection[str]] | int | ReportColumns = ReportColumns(
univ=100, biv=20, knn=100, phik=20
),
) -> None
Collect summary statistics for the evaluation of synthetic data in terms of data quality and privacy protection.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
data_train
|
RelationalData
|
A |
required |
data_test
|
RelationalData
|
A |
required |
data_synth
|
RelationalData
|
A |
required |
path
|
Path | str
|
A path to save the report. |
required |
n_max_train
|
int | None
|
The maximum number of samples per table (for train data) to use in the report. |
5000
|
n_max_test
|
int | None
|
The maximum number of samples per table (for test data) to use in the report. |
5000
|
columns
|
dict[str, int | Collection[str]] | int | ReportColumns
|
The columns to use for the computation of the report sections.
It can be an instance of |
ReportColumns(univ=100, biv=20, knn=100, phik=20)
|
compute_privacy_stats
compute_privacy_stats(
data_train: RelationalData,
data_synth: RelationalData,
q: float = 0.1,
risk_confidence: float = 0.0,
n_folds_std: int | None = 10,
n_max: int | None = None,
) -> dict[str, PrivacyStats | None]
Compute privacy statistics for the evaluation of synthetic data.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
data_train
|
RelationalData
|
A |
required |
data_synth
|
RelationalData
|
A |
required |
q
|
float
|
The quantile used to compute the privacy score and the number of records at risk of re-identification. |
0.1
|
risk_confidence
|
float
|
A confidence parameter for the estimation of the number of records at risk
of re-identification. The estimated number of records at risk ( |
0.0
|
n_folds_std
|
int | None
|
Number of folds to use in the computation of the standard deviation, must be larger than 1.
If |
10
|
n_max
|
int | None
|
The maximum number of samples per table (for both train and synth data) to use in the computation. |
None
|
Returns:
Type | Description |
---|---|
dict[str, PrivacyStats | None]
|
A dictionary mapping each table to a |