Skip to content

Text to Dataset (LLM synthesizer)

ydata.synthesizers.LLMSynthesizer

Generates tabular or multi-table synthetic data from a prompt-based schema (no source dataset).

Use fit(tables=...) to set the schema, then sample(sample_size=...) to generate.

Example (financial services): >>> from ydata.synthesizers import LLMSynthesizer >>> synth = LLMSynthesizer(model="openai/gpt-5-nano") >>> tables = { ... "transactions": { ... "prompt": "Credit card transactions", ... "columns": { ... "transaction_id": {"prompt": "unique id", "dtype": "string"}, ... "amount": {"prompt": "amount", "dtype": "float"}, ... }, ... } ... } >>> synth.fit(tables=tables) >>> data = synth.sample(sample_size=100)

fit(tables, existing_data=None)

Set the schema used for generation.

Parameters:

Name Type Description Default
tables dict[str, dict]

Map of table name -> {"prompt": str, "columns": {col: {"prompt", "dtype"} or {"dtype": "category", "values": [...]}}. Optional per table: "primary_key", "foreign_keys" (list of {column, referenced_table, prompt}).

required
existing_data dict[str, DataFrame] | None

Optional. If provided, new columns are generated for these rows (e.g. enrich existing transactions).

None

Returns:

Type Description
'LLMSynthesizer'

self

sample(sample_size=4, progress_callback=None)

Generate rows from the schema set in fit().

Parameters:

Name Type Description Default
sample_size int | dict[str, int]

Rows per root table. int (same for all) or dict[table_name, int]. Default 4.

4
progress_callback Callable[..., Awaitable[None]] | None

Optional async callback for progress (e.g. table, rows, percentage).

None

Returns:

Type Description
Dataset | MultiDataset

Dataset if one table, else MultiDataset.

Raises:

Type Description
ValueError

If fit() was not called.