Skip to content

Text to Dataset (LLM synthesizer)

ydata.synthesizers.llm.LLMSynthesizer

Generates tabular or multi-table synthetic data from a prompt-based schema (no source dataset).

Use fit(tables=...) to set the schema, then sample(sample_size=...) to generate.

Example (financial services): >>> from ydata.synthesizers import LLMSynthesizer >>> synth = LLMSynthesizer(model="gpt-5-mini-2025-08-07-dzs-eus2") >>> tables = { ... "transactions": { ... "prompt": "Credit card transactions", ... "columns": { ... "transaction_id": {"prompt": "unique id", "dtype": "string"}, ... "amount": {"prompt": "amount", "dtype": "float"}, ... }, ... } ... } >>> synth.fit(tables=tables) >>> data = synth.sample(sample_size=100)

fit(tables, existing_data=None)

Set the schema used for generation.

Parameters:

Name Type Description Default
tables dict[str, dict]

Map of table name -> {"prompt": str, "columns": {col: {"prompt", "dtype"} or {"dtype": "category", "values": [...]}}. Optional per table: "primary_key", "foreign_keys" (list of {column, referenced_table, prompt}), and "table_errors" (dict with "referential_integrity" list).

Columns may include an optional "pii" dict to guide PII generation style::

   "email": {
       "prompt": "customer email address",
       "dtype": "string",
       "pii": {
           "format": "email",
           "examples": ["john.doe@gmail.com", "alice.smith@yahoo.com"],
           "pattern": "{first}.{last}@{domain}"
       }
   }

Supported pii keys (all optional): - format: lightweight hint (email, name, phone, company, free_text). - examples: list of representative values the LLM should mimic in style. - pattern: soft template string (not enforced as regex).

Columns may include an optional "errors" dict for error injection::

   "merchant_email": {
       "prompt": "merchant contact email address",
       "dtype": "string",
       "errors": {
           "format_violation": 0.08,
           "missing": 0.04,
       }
   }

Supported errors keys (all optional, float in [0, 1]): - format_violation: fraction of rows where the LLM produces malformed values. - missing: fraction of rows set to NaN in post-processing.

Tables may include an optional "table_errors" dict::

   "table_errors": {
       "referential_integrity": [
           {"column": "card_id", "rate": 0.03}
       ]
   }

Each entry in referential_integrity must reference a declared FK column. rate is the fraction of rows whose FK value is replaced with a non-existent parent key.

required
existing_data dict[str, DataFrame] | None

Optional. If provided, new columns are generated for these rows (e.g. enrich existing transactions).

None

Returns:

Type Description
'LLMSynthesizer'

self

sample(sample_size=4, progress_callback=None)

Generate rows from the schema set in fit().

Parameters:

Name Type Description Default
sample_size int | dict[str, int]

Rows per root table. int (same for all) or dict[table_name, int]. Default 4.

4
progress_callback Callable[..., Awaitable[None]] | None

Optional async callback for progress (e.g. table, rows, percentage).

None

Returns:

Type Description
Dataset | MultiDataset

Dataset if one table, else MultiDataset.

Raises:

Type Description
ValueError

If fit() was not called.