Synthetic Data Quality Report

After training a synthetic data generator, a Report PDF is automatically generated to provide a comprehensive evaluation of the synthetic data's quality. This report is designed to help users assess the performance of the synthetic data across three key dimensions: Privacy, Fidelity, and Utility. Each dimension is calculated using a variety of metrics, ensuring a robust and detailed analysis.

Key Scores in the Report

Privacy

The Privacy score evaluates how well the synthetic data protects sensitive information from the original dataset. It ensures that the synthetic data does not inadvertently reveal private or confidential details. Metrics used to calculate this score may include:

Distance to Closest Record (DCR): Measures how close synthetic data points are to real data points.
Membership Inference Attack (MIA) Risk: Assesses the likelihood of identifying whether a specific record was part of the training data.
Uniqueness: Evaluates the uniqueness of synthetic records compared to the original dataset.

Fidelity

The Fidelity score assesses how closely the synthetic data resembles the original dataset in terms of statistical properties and distributions. High fidelity ensures that the synthetic data is a realistic representation of the original data. Metrics used to calculate this score may include:

Statistical Similarity: Compares summary statistics (e.g., mean, variance) between synthetic and real data.
Distribution Similarity: Measures the similarity of data distributions (e.g., using Kolmogorov-Smirnov test or Wasserstein distance).
Correlation Preservation: Evaluates how well relationships between variables are maintained in the synthetic data.

Utility

The Utility score measures the practical usefulness of the synthetic data for downstream tasks, such as machine learning model training or analysis. High utility ensures that the synthetic data performs well in real-world applications. Metrics used to calculate this score may include:

Machine Learning Performance: Compares the performance of models trained on synthetic data versus real data (e.g., accuracy, F1-score).
Feature Importance Consistency: Assesses whether the synthetic data preserves the importance of features for predictive tasks.
Downstream Task Performance: Evaluates how well the synthetic data performs in specific use cases (e.g., classification, regression).

How the Report is Generated

The Report PDF leverages a combination of these metrics to calculate the overall scores for Privacy, Fidelity, and Utility. The metrics are carefully selected to provide a holistic view of synthetic data quality, ensuring that users can confidently use the synthetic data for their intended purposes.

For a deeper understanding of the metrics and methodologies used, refer to the complete documentation on synthetic data quality metrics at YData's Synthetic Data Quality Metrics.