Text to Dataset (LLM synthesizer)

`ydata.synthesizers.llm.LLMSynthesizer`

Generates tabular or multi-table synthetic data from a prompt-based schema (no source dataset).

Use fit(tables=...) to set the schema, then sample(sample_size=...) to generate.

Example (financial services): >>> from ydata.synthesizers import LLMSynthesizer >>> synth = LLMSynthesizer(model="gpt-5-mini-2025-08-07-dzs-eus2") >>> tables = { ... "transactions": { ... "prompt": "Credit card transactions", ... "columns": { ... "transaction_id": {"prompt": "unique id", "dtype": "string"}, ... "amount": {"prompt": "amount", "dtype": "float"}, ... }, ... } ... } >>> synth.fit(tables=tables) >>> data = synth.sample(sample_size=100)

`fit(tables, existing_data=None)`

Set the schema used for generation.

Parameters:

Name	Type	Description	Default
`tables`	`dict[str, dict]`	Map of table name -> {"prompt": str, "columns": {col: {"prompt", "dtype"} or {"dtype": "category", "values": [...]}}. Optional per table: "primary_key", "foreign_keys" (list of {column, referenced_table, prompt}), and "table_errors" (dict with "referential_integrity" list). Columns may include an optional "pii" dict to guide PII generation style:: `"email": { "prompt": "customer email address", "dtype": "string", "pii": { "format": "email", "examples": ["john.doe@gmail.com", "alice.smith@yahoo.com"], "pattern": "{first}.{last}@{domain}" } }` Supported `pii` keys (all optional): - `format`: lightweight hint (`email`, `name`, `phone`, `company`, `free_text`). - `examples`: list of representative values the LLM should mimic in style. - `pattern`: soft template string (not enforced as regex). Columns may include an optional "errors" dict for error injection:: `"merchant_email": { "prompt": "merchant contact email address", "dtype": "string", "errors": { "format_violation": 0.08, "missing": 0.04, } }` Supported `errors` keys (all optional, float in [0, 1]): - `format_violation`: fraction of rows where the LLM produces malformed values. - `missing`: fraction of rows set to NaN in post-processing. Tables may include an optional "table_errors" dict:: `"table_errors": { "referential_integrity": [ {"column": "card_id", "rate": 0.03} ] }` Each entry in `referential_integrity` must reference a declared FK column. `rate` is the fraction of rows whose FK value is replaced with a non-existent parent key.	required
`existing_data`	`dict[str, DataFrame] \| None`	Optional. If provided, new columns are generated for these rows (e.g. enrich existing transactions).	`None`

Returns:

Type	Description
`'LLMSynthesizer'`	self

`sample(sample_size=4, progress_callback=None)`

Generate rows from the schema set in fit().

Parameters:

Name	Type	Description	Default
`sample_size`	`int \| dict[str, int]`	Rows per root table. int (same for all) or dict[table_name, int]. Default 4.	`4`
`progress_callback`	`Callable[..., Awaitable[None]] \| None`	Optional async callback for progress (e.g. table, rows, percentage).	`None`

Returns:

Type	Description
`Dataset \| MultiDataset`	Dataset if one table, else MultiDataset.

Raises:

Type	Description
`ValueError`	If `fit()` was not called.