Skip to content

MultiTable Synthetic Data from a Set of CSV Files

This guide demonstrates how to generate synthetic data from a collection of CSV files representing related tables in a database. Multi-table synthesis allows you to retain structural relationships—such as foreign key dependencies—between multiple tables, while anonymizing sensitive data and preserving statistical properties.

This approach is ideal for data scientists, developers, and analysts working with complex data structures who need privacy-safe datasets for testing, prototyping, or sharing.

This script demonstrates how to: - Load data from multiple CSV files - Read your schema and relationships from a yaml file - Create and configure your MultiTableSynthesizer - Generate synthetic data for each table while preserving relationships

Don't forget to set up your license key

    import os

    os.environ['YDATA_LICENSE_KEY'] = '{add-your-key}'

Make sure to check your schema before training your synthetic data generator

Before starting, make sure your CSV files are cleaned and correctly reflect the schema relationships (e.g., primary and foreign keys).

Create your database relations as a yaml file

To correctly synthesize multi-table data, you must define the relationships between tables. This includes specifying primary keys and foreign keys, which help the synthesizer understand how records are connected across your dataset.

These relationships are defined in a YAML configuration file. Each table is represented as a key, with primary_keys and optional foreign_keys fields. Foreign keys should clearly indicate the column, parent table, and the corresponding column in the parent table.

Below is an example of a valid YAML schema describing relationships for a set of CSV files from a banking dataset:

district: 
  primary_keys: [A1]

client:
  primary_keys: [client_id]
  foreign_keys:
    - {column: district_id, parent_table: district, parent_column: A1}

account:
  primary_keys: [account_id]
  foreign_keys:
    - {column: district_id, parent_table: district, parent_column: A1}

disp:
  primary_keys: [disp_id]
  foreign_keys:
    - {column: account_id, parent_table: account, parent_column: account_id}
    - {column: client_id, parent_table: client, parent_column: client_id}

card:
  primary_keys: [card_id]
  foreign_keys:
    - {column: disp_id, parent_table: disp, parent_column: disp_id}

loan:
  primary_keys: [loan_id]
  foreign_keys:
    - {column: account_id, parent_table: account, parent_column: account_id}

order:
  primary_keys: [order_id]
  foreign_keys:
    - {column: account_id, parent_table: account, parent_column: account_id}

trans:
  primary_keys: [trans_id]
  foreign_keys:
    - {column: account_id, parent_table: account, parent_column: account_id}
💡 Tip: Save this file as schema.yaml (or any name you prefer), and use it later when initializing the multi-table synthesizer.

Example Code

"""
    File to test the multitable synthesizer.
    This is an example for the Berka database using the MySQL connector
"""
import os
import pprint
from yaml import safe_load

import pandas as pd

from ydata.dataset import Dataset, MultiDataset
from ydata.metadata import MultiMetadata
from ydata.synthesizers import MultiTableSynthesizer

TRAIN = True
GENERATE = True
WRITE = False
out_path = "./multitable_model.pkl"

if __name__ == "__main__":

    # To print the dictionary structure (Optional)
    pp = pprint.PrettyPrinter(indent=4)

    """
    # Step 1: Read all your CSV files using Pandas. Create a dictionary with the following structure
                    {'table_name': table_daraframe}
    """"
    datasets = {}

    for f in os.listdir('data'):
        if f.endswith(".csv"):
            path = os.path.join('data', f)
            df = pd.read_csv(path, sep=';')
            dataset = Dataset(df)

            datasets[f.replace('.csv', '')] = dataset

    pp.pprint(datasets)

    """
    Step 2: Read your database schema from an existing yaml file
    The yaml file is expected to have the following structure:

        table1_name:
          primary_keys: [table1_PK]

        table2_name:
          primary_keys: [table2_PK]
          foreign_keys:
            - {column: table1_PK, parent_table: table1_name, parent_column: table1_PK}
    """

    with open(f"data/schema.yml", "r") as f:
        schema = safe_load(f)

    pp.pprint(schema)

    """
    Step 3: Create yout MultiDataset object
    """
    database = MultiDataset(datasets=datasets,
                            schema=schema)
    print(database)

    #Validate if the data from your database is correct
    database['order'].head()

    """
    Step 4: Create the MultiMetadata object
    """
    m = MultiMetadata(multiset=database)
    print(m)

    """
    Step 5: Train your MultiTableSynthesizer
    """
    if TRAIN:
        synth = MultiTableSynthesizer()
        synth.fit(X=database,
                  metadata=m)
        synth.save(out_path)
    else:
        synth = MultiTableSynthesizer.load(out_path)

    if GENERATE and WRITE:
        sample = 1
        sample = synth.sample(n_samples=sample)
    else:
        sample = 1
        sample = synth.sample(n_samples=sample)
        print(f"Generate synthetic database: {sample}")