Skip to content

LLM Synthesizer: PII field styling

The LLMSynthesizer can generate tables or multi-table data from a prompt-defined schema. For columns that represent emails, names, phones, or similar fields, you can attach an optional pii block on each column definition. That metadata is turned into PII Generation Guidance in the model prompt so the LLM produces new synthetic values that match the style you describe (casing, separators, domain patterns).

Not differential privacy or legal anonymization

This feature is not the tabular Privacy levels used by RegularSynthesizer / DifferentialPrivacyLayer. It does not guarantee non-identifiability or compliance by itself. Treat pii as prompt guidance for realistic synthetic strings, and run your own privacy and compliance review when needed.

What the pii block does

  • format: A lightweight hint such as email, name, phone, company, or free_text.
  • examples: Representative values the model should mimic in style (most influential). The model must not copy them verbatim; it should invent new values in the same style.
  • pattern: A soft template (for example {first}.{last}@{domain}). It is not validated as a regular expression.

When several of these are present, the implementation generally prioritizes examples over pattern over format over the plain column prompt. No hard validation is applied to outputs—this is guidance, not a schema constraint.

When PII guidance is omitted

If you use fit(tables=..., existing_data=...) to enrich existing rows, PII guidance is not added for columns that already appear in the provided existing_data frame for that table (those columns are taken from your data).

Full example

The repository includes a runnable example that covers a single table and a multi-table setup with foreign keys:

"""
Example for the LLM synthesizer: controlling PII generation style.

Demonstrates how to use the ``pii`` field in column specifications to guide
the LLM when generating sensitive data such as emails, names, phone numbers,
and company names.

The ``pii`` block supports three optional knobs:
  - ``format``   – lightweight hint (email, name, phone, company, free_text).
  - ``examples`` – representative values the LLM should mimic in style (most important).
  - ``pattern``  – a soft template string (not enforced as regex).

Priority when multiple signals are present: examples > pattern > format > prompt.
No hard validation is applied — this is guidance, not constraints.

Provide a Workbench subscription key as the ``subscription_key`` argument of
``LLMSynthesizer``, or use an internal build with an embedded default.
"""
from ydata.synthesizers.llm.model import LLMSynthesizer


if __name__ == "__main__":

    # ------------------------------------------------------------------
    # 1. Single table with PII guidance on several columns
    # ------------------------------------------------------------------
    tables = {
        "customers": {
            "prompt": "Customers of an online electronics store in the US",
            "columns": {
                "customer_id": {
                    "prompt": "unique identifier",
                    "dtype": "string",
                },
                "full_name": {
                    "prompt": "full name of the customer",
                    "dtype": "string",
                    "pii": {
                        "format": "name",
                        "examples": ["John Doe", "Alice Smith", "Carlos Rivera"],
                    },
                },
                "email": {
                    "prompt": "customer email address",
                    "dtype": "string",
                    "pii": {
                        "format": "email",
                        "examples": [
                            "john.doe@gmail.com",
                            "alice_smith@company.org",
                            "carlos.rivera@outlook.com",
                        ],
                        "pattern": "{first}.{last}@{domain}",
                    },
                },
                "phone": {
                    "prompt": "US phone number",
                    "dtype": "string",
                    "pii": {
                        "format": "phone",
                        "examples": ["+1-555-123-4567", "+1-555-987-6543"],
                    },
                },
                "company": {
                    "prompt": "employer or company name",
                    "dtype": "string",
                    "pii": {
                        "format": "company",
                        "examples": ["Acme Corp", "Globex Industries", "Initech LLC"],
                    },
                },
                "loyalty_tier": {
                    "dtype": "category",
                    "values": ["bronze", "silver", "gold", "platinum"],
                },
            },
            "primary_key": "customer_id",
        }
    }

    synth = LLMSynthesizer(model="gpt-5-mini-2025-08-07-dzs-eus2")
    synth.fit(tables=tables)
    data = synth.sample(sample_size=10)

    print("Customers with PII-guided columns:")
    print(data.head(10))
    print()

    # ------------------------------------------------------------------
    # 2. Multi-table with PII guidance and foreign keys
    # ------------------------------------------------------------------
    tables = {
        "employees": {
            "prompt": "Employees of a mid-size tech company",
            "columns": {
                "employee_id": {
                    "prompt": "unique employee identifier",
                    "dtype": "string",
                },
                "name": {
                    "prompt": "first and last name",
                    "dtype": "string",
                    "pii": {
                        "format": "name",
                        "examples": ["Maria Garcia", "Wei Chen", "James O'Brien"],
                    },
                },
                "work_email": {
                    "prompt": "corporate email address",
                    "dtype": "string",
                    "pii": {
                        "format": "email",
                        "examples": [
                            "maria.garcia@techcorp.com",
                            "wei.chen@techcorp.com",
                        ],
                        "pattern": "{first}.{last}@techcorp.com",
                    },
                },
                "department": {
                    "dtype": "category",
                    "values": ["engineering", "sales", "marketing", "hr", "finance"],
                },
            },
            "primary_key": "employee_id",
        },
        "projects": {
            "prompt": "Internal projects at the company",
            "columns": {
                "project_id": {
                    "prompt": "unique project identifier",
                    "dtype": "string",
                },
                "lead_id": {
                    "prompt": "employee leading the project",
                    "dtype": "string",
                },
                "project_name": {
                    "prompt": "name of the project",
                    "dtype": "string",
                },
                "client_contact_email": {
                    "prompt": "external client contact email",
                    "dtype": "string",
                    "pii": {
                        "format": "email",
                        "examples": [
                            "contact@acme-corp.com",
                            "info@globex.io",
                        ],
                    },
                },
            },
            "primary_key": "project_id",
            "foreign_keys": [
                {
                    "column": "lead_id",
                    "referenced_table": "employees",
                    "prompt": "each employee leads between 1 and 2 projects",
                },
            ],
        },
    }

    synth = LLMSynthesizer(model="gpt-5-mini-2025-08-07-dzs-eus2")
    synth.fit(tables=tables)
    data = synth.sample(sample_size=4)

    print("Employees:")
    print(data["employees"].head())
    print()
    print("Projects (with PII-guided client contact emails):")
    print(data["projects"].head())