LLM Synthesizer: error injection
You can configure the LLMSynthesizer to produce realistic data-quality issues alongside normal rows: malformed values, missing cells, and occasional broken foreign-key references. Rates are expressed as fractions in [0, 1].
Approximate and stochastic
format_violationis implemented via prompt instructions to the LLM. The observed share of bad values will be approximate, not exact.missingandreferential_integrityuse randomized post-processing on the generated frames. Expect statistical variation around the configured rates.
Column-level errors
Under each column in your tables schema, an optional errors object may include:
| Key | Effect |
|---|---|
format_violation |
The model is asked to produce roughly that fraction of malformed values (e.g. bad email shape, truncated IDs). Applied at generation time via prompt guidance. |
missing |
After generation, that fraction of cells in the column is set to NaN. Primary key columns are skipped so keys stay usable. |
You can combine both on the same column (for example messy strings and occasional nulls).
Table-level table_errors
Per table, an optional table_errors object supports:
| Key | Effect |
|---|---|
referential_integrity |
A list of {"column": "<fk column name>", "rate": <float>}. For each entry, about rate of rows get the FK value replaced with a non-existent parent key, so referential integrity breaks in a controlled way. The column must be a declared foreign key. |
Full example
"""
Example for the LLM synthesizer: intentional error injection.
Demonstrates how to use column-level ``errors`` and table-level ``table_errors``
to introduce realistic data-quality issues into generated data.
Supported error types:
- ``format_violation`` (column-level): the LLM is instructed to produce a
given percentage of malformed values (e.g. invalid emails, truncated IDs).
- ``missing`` (column-level): a fraction of cells is set to NaN after
generation, simulating missing data.
- ``referential_integrity`` (table-level): a fraction of FK values is
replaced with non-existent parent keys, simulating broken references.
Provide a Workbench subscription key as the ``subscription_key`` argument of
``LLMSynthesizer``, or use an internal build with an embedded default.
"""
from ydata.synthesizers.llm.model import LLMSynthesizer
if __name__ == "__main__":
# ------------------------------------------------------------------
# Multi-table with column-level and table-level errors
# ------------------------------------------------------------------
tables = {
"cards": {
"prompt": "Credit cards issued by a financial services company",
"columns": {
"card_id": {"prompt": "unique card identifier", "dtype": "string"},
"holder_name": {
"prompt": "cardholder full name",
"dtype": "string",
"errors": {
"missing": 0.05,
},
},
"card_type": {
"dtype": "category",
"values": ["visa", "mastercard", "amex"],
},
},
"primary_key": "card_id",
},
"transactions": {
"prompt": "Credit card transactions for a financial services dataset",
"columns": {
"transaction_id": {
"prompt": "unique identifier for the transaction",
"dtype": "string",
},
"card_id": {
"prompt": "identifier of the credit card",
"dtype": "string",
},
"merchant_email": {
"prompt": "merchant contact email address",
"dtype": "string",
"errors": {
"format_violation": 0.08,
"missing": 0.04,
},
},
"amount": {"prompt": "transaction amount", "dtype": "float"},
},
"primary_key": "transaction_id",
"foreign_keys": [
{
"column": "card_id",
"referenced_table": "cards",
"prompt": "transactions reference existing cards",
}
],
"table_errors": {
"referential_integrity": [
{
"column": "card_id",
"rate": 0.03,
}
]
},
},
}
synth = LLMSynthesizer(model="gpt-5-mini-2025-08-07-dzs-eus2")
synth.fit(tables=tables)
data = synth.sample(sample_size=50)
print("Cards:")
print(data["cards"].head(10))
print()
print("Transactions (with injected errors):")
print(data["transactions"].head(20))
print()
# Quick sanity checks
txn = data["transactions"]
valid_cards = set(data["cards"]["card_id"])
broken_refs = txn[~txn["card_id"].isin(valid_cards)]
missing_emails = txn["merchant_email"].isna().sum()
print(f"Broken card_id references: {len(broken_refs)} / {len(txn)}")
print(f"Missing merchant_email values: {missing_emails} / {len(txn)}")
Related
