MultiTable Synthetic Data from a Set of CSV Files
This guide demonstrates how to generate synthetic data from a collection of CSV files representing related tables in a database. Multi-table synthesis allows you to retain structural relationships—such as foreign key dependencies—between multiple tables, while anonymizing sensitive data and preserving statistical properties.
This approach is ideal for data scientists, developers, and analysts working with complex data structures who need privacy-safe datasets for testing, prototyping, or sharing.
This script demonstrates how to: - Load data from multiple CSV files - Read your schema and relationships from a yaml file - Create and configure your MultiTableSynthesizer - Generate synthetic data for each table while preserving relationships
Don't forget to set up your license key
Make sure to check your schema before training your synthetic data generator
Before starting, make sure your CSV files are cleaned and correctly reflect the schema relationships (e.g., primary and foreign keys).
Create your database relations as a yaml file
To correctly synthesize multi-table data, you must define the relationships between tables. This includes specifying primary keys and foreign keys, which help the synthesizer understand how records are connected across your dataset.
These relationships are defined in a YAML configuration file. Each table is represented as a key, with primary_keys
and optional foreign_keys
fields. Foreign keys should clearly indicate the column, parent table, and the corresponding column in the parent table.
Below is an example of a valid YAML schema describing relationships for a set of CSV files from a banking dataset:
district:
primary_keys: [A1]
client:
primary_keys: [client_id]
foreign_keys:
- {column: district_id, parent_table: district, parent_column: A1}
account:
primary_keys: [account_id]
foreign_keys:
- {column: district_id, parent_table: district, parent_column: A1}
disp:
primary_keys: [disp_id]
foreign_keys:
- {column: account_id, parent_table: account, parent_column: account_id}
- {column: client_id, parent_table: client, parent_column: client_id}
card:
primary_keys: [card_id]
foreign_keys:
- {column: disp_id, parent_table: disp, parent_column: disp_id}
loan:
primary_keys: [loan_id]
foreign_keys:
- {column: account_id, parent_table: account, parent_column: account_id}
order:
primary_keys: [order_id]
foreign_keys:
- {column: account_id, parent_table: account, parent_column: account_id}
trans:
primary_keys: [trans_id]
foreign_keys:
- {column: account_id, parent_table: account, parent_column: account_id}
Example Code
"""
File to test the multitable synthesizer.
This is an example for the Berka database using the MySQL connector
"""
import os
import pprint
from yaml import safe_load
import pandas as pd
from ydata.dataset import Dataset, MultiDataset
from ydata.metadata import MultiMetadata
from ydata.synthesizers import MultiTableSynthesizer
TRAIN = True
GENERATE = True
WRITE = False
out_path = "./multitable_model.pkl"
if __name__ == "__main__":
# To print the dictionary structure (Optional)
pp = pprint.PrettyPrinter(indent=4)
"""
# Step 1: Read all your CSV files using Pandas. Create a dictionary with the following structure
{'table_name': table_daraframe}
""""
datasets = {}
for f in os.listdir('data'):
if f.endswith(".csv"):
path = os.path.join('data', f)
df = pd.read_csv(path, sep=';')
dataset = Dataset(df)
datasets[f.replace('.csv', '')] = dataset
pp.pprint(datasets)
"""
Step 2: Read your database schema from an existing yaml file
The yaml file is expected to have the following structure:
table1_name:
primary_keys: [table1_PK]
table2_name:
primary_keys: [table2_PK]
foreign_keys:
- {column: table1_PK, parent_table: table1_name, parent_column: table1_PK}
"""
with open(f"data/schema.yml", "r") as f:
schema = safe_load(f)
pp.pprint(schema)
"""
Step 3: Create yout MultiDataset object
"""
database = MultiDataset(datasets=datasets,
schema=schema)
print(database)
#Validate if the data from your database is correct
database['order'].head()
"""
Step 4: Create the MultiMetadata object
"""
m = MultiMetadata(multiset=database)
print(m)
"""
Step 5: Train your MultiTableSynthesizer
"""
if TRAIN:
synth = MultiTableSynthesizer()
synth.fit(X=database,
metadata=m)
synth.save(out_path)
else:
synth = MultiTableSynthesizer.load(out_path)
if GENERATE and WRITE:
sample = 1
sample = synth.sample(n_samples=sample)
else:
sample = 1
sample = synth.sample(n_samples=sample)
print(f"Generate synthetic database: {sample}")