Synthetic Data vs Original Data

One of the most powerful features of the synthetic data generation process is the ability to generate a Profiling Comparison report. This report provides a detailed, side-by-side analysis of the original dataset and the synthetic dataset, enabling users to visually and statistically compare their properties. The profiling comparison is designed to help users quickly identify similarities and differences between the two datasets, ensuring that the synthetic data meets the desired quality standards.

Key Features of the Profiling Comparison

Statistical Summary Comparison
The report includes a comprehensive statistical summary for both datasets, allowing users to compare key metrics such as:
Mean, Median, and Mode for numerical features.
Standard Deviation and Variance to assess data spread.
Minimum and Maximum Values to understand data ranges.
Missing Values to ensure data completeness.
Distribution Visualization
The profiling comparison provides visualizations of the distributions for each feature in both datasets. This includes:
Histograms and Kernel Density Estimates (KDE) for numerical features.
Bar Charts for categorical features.
Cumulative Distribution Functions (CDFs) to compare overall distributions.

These visualizations make it easy to spot discrepancies or confirm that the synthetic data closely matches the original data's distribution.

Correlation Analysis
The report includes a comparison of correlation matrices for both datasets, highlighting:
Pearson Correlation for linear relationships between numerical features.
Spearman Correlation for monotonic relationships.
Heatmaps to visualize correlation strengths and patterns.

This ensures that the synthetic data preserves the relationships between variables, which is critical for downstream tasks like machine learning.

Feature-Level Insights
For each feature, the profiling comparison provides:
Descriptive Statistics: A side-by-side comparison of key statistics.
Uniqueness Analysis: A comparison of unique values and their frequencies.
Outlier Detection: Identification of outliers in both datasets.
Interactive Exploration
The profiling comparison report is designed to be interactive, allowing users to:
Drill down into specific features for deeper analysis.
Toggle between visualizations (e.g., histograms, box plots).
Export visualizations or statistics for further use.

How to Use the Profiling Comparison

After generating a synthetic dataset, the profiling comparison report is automatically generated and can be accessed through the package's interface. Users can: 1. View the Report: Navigate to the profiling comparison section to explore the visual and statistical comparisons. 2. Download the Report: Export the report as a JSON or HTML for sharing or further analysis. 3. Adjust Parameters: If discrepancies are found, users can fine-tune the synthetic data generation process and regenerate the report.

Example Use Case

For example, if you generate synthetic data for a customer dataset, the profiling comparison will allow you to: - Verify that the synthetic data maintains the same age distribution as the original dataset. - Ensure that the correlation between income and spending habits is preserved. - Confirm that categorical features like "gender" or "region" have similar proportions in both datasets.