Text to Dataset (LLM synthesizer)
ydata.synthesizers.LLMSynthesizer
Generates tabular or multi-table synthetic data from a prompt-based schema (no source dataset).
Use fit(tables=...) to set the schema, then sample(sample_size=...) to generate.
Example (financial services): >>> from ydata.synthesizers import LLMSynthesizer >>> synth = LLMSynthesizer(model="openai/gpt-5-nano") >>> tables = { ... "transactions": { ... "prompt": "Credit card transactions", ... "columns": { ... "transaction_id": {"prompt": "unique id", "dtype": "string"}, ... "amount": {"prompt": "amount", "dtype": "float"}, ... }, ... } ... } >>> synth.fit(tables=tables) >>> data = synth.sample(sample_size=100)
fit(tables, existing_data=None)
Set the schema used for generation.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
tables
|
dict[str, dict]
|
Map of table name -> {"prompt": str, "columns": {col: {"prompt", "dtype"} or {"dtype": "category", "values": [...]}}. Optional per table: "primary_key", "foreign_keys" (list of {column, referenced_table, prompt}). |
required |
existing_data
|
dict[str, DataFrame] | None
|
Optional. If provided, new columns are generated for these rows (e.g. enrich existing transactions). |
None
|
Returns:
| Type | Description |
|---|---|
'LLMSynthesizer'
|
self |
sample(sample_size=4, progress_callback=None)
Generate rows from the schema set in fit().
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
sample_size
|
int | dict[str, int]
|
Rows per root table. int (same for all) or dict[table_name, int]. Default 4. |
4
|
progress_callback
|
Callable[..., Awaitable[None]] | None
|
Optional async callback for progress (e.g. table, rows, percentage). |
None
|
Returns:
| Type | Description |
|---|---|
Dataset | MultiDataset
|
Dataset if one table, else MultiDataset. |
Raises:
| Type | Description |
|---|---|
ValueError
|
If |
