Skip to content

LLM Synthesizer: error injection

You can configure the LLMSynthesizer to produce realistic data-quality issues alongside normal rows: malformed values, missing cells, and occasional broken foreign-key references. Rates are expressed as fractions in [0, 1].

Approximate and stochastic

  • format_violation is implemented via prompt instructions to the LLM. The observed share of bad values will be approximate, not exact.
  • missing and referential_integrity use randomized post-processing on the generated frames. Expect statistical variation around the configured rates.

Column-level errors

Under each column in your tables schema, an optional errors object may include:

Key Effect
format_violation The model is asked to produce roughly that fraction of malformed values (e.g. bad email shape, truncated IDs). Applied at generation time via prompt guidance.
missing After generation, that fraction of cells in the column is set to NaN. Primary key columns are skipped so keys stay usable.

You can combine both on the same column (for example messy strings and occasional nulls).

Table-level table_errors

Per table, an optional table_errors object supports:

Key Effect
referential_integrity A list of {"column": "<fk column name>", "rate": <float>}. For each entry, about rate of rows get the FK value replaced with a non-existent parent key, so referential integrity breaks in a controlled way. The column must be a declared foreign key.

Full example

"""
Example for the LLM synthesizer: intentional error injection.

Demonstrates how to use column-level ``errors`` and table-level ``table_errors``
to introduce realistic data-quality issues into generated data.

Supported error types:
  - ``format_violation`` (column-level): the LLM is instructed to produce a
    given percentage of malformed values (e.g. invalid emails, truncated IDs).
  - ``missing`` (column-level): a fraction of cells is set to NaN after
    generation, simulating missing data.
  - ``referential_integrity`` (table-level): a fraction of FK values is
    replaced with non-existent parent keys, simulating broken references.

Provide a Workbench subscription key as the ``subscription_key`` argument of
``LLMSynthesizer``, or use an internal build with an embedded default.
"""
from ydata.synthesizers.llm.model import LLMSynthesizer


if __name__ == "__main__":

    # ------------------------------------------------------------------
    # Multi-table with column-level and table-level errors
    # ------------------------------------------------------------------
    tables = {
        "cards": {
            "prompt": "Credit cards issued by a financial services company",
            "columns": {
                "card_id": {"prompt": "unique card identifier", "dtype": "string"},
                "holder_name": {
                    "prompt": "cardholder full name",
                    "dtype": "string",
                    "errors": {
                        "missing": 0.05,
                    },
                },
                "card_type": {
                    "dtype": "category",
                    "values": ["visa", "mastercard", "amex"],
                },
            },
            "primary_key": "card_id",
        },
        "transactions": {
            "prompt": "Credit card transactions for a financial services dataset",
            "columns": {
                "transaction_id": {
                    "prompt": "unique identifier for the transaction",
                    "dtype": "string",
                },
                "card_id": {
                    "prompt": "identifier of the credit card",
                    "dtype": "string",
                },
                "merchant_email": {
                    "prompt": "merchant contact email address",
                    "dtype": "string",
                    "errors": {
                        "format_violation": 0.08,
                        "missing": 0.04,
                    },
                },
                "amount": {"prompt": "transaction amount", "dtype": "float"},
            },
            "primary_key": "transaction_id",
            "foreign_keys": [
                {
                    "column": "card_id",
                    "referenced_table": "cards",
                    "prompt": "transactions reference existing cards",
                }
            ],
            "table_errors": {
                "referential_integrity": [
                    {
                        "column": "card_id",
                        "rate": 0.03,
                    }
                ]
            },
        },
    }

    synth = LLMSynthesizer(model="gpt-5-mini-2025-08-07-dzs-eus2")
    synth.fit(tables=tables)
    data = synth.sample(sample_size=50)

    print("Cards:")
    print(data["cards"].head(10))
    print()
    print("Transactions (with injected errors):")
    print(data["transactions"].head(20))
    print()

    # Quick sanity checks
    txn = data["transactions"]
    valid_cards = set(data["cards"]["card_id"])
    broken_refs = txn[~txn["card_id"].isin(valid_cards)]
    missing_emails = txn["merchant_email"].isna().sum()

    print(f"Broken card_id references: {len(broken_refs)} / {len(txn)}")
    print(f"Missing merchant_email values: {missing_emails} / {len(txn)}")