Text to dataset
Overview
YData SDK supports text to dataset generation: creating tabular data and multi-table databases from natural language descriptions without an existing dataset. Using the LLM Synthesizer, you describe tables and columns in prompts; the model produces a single table (Dataset) or multiple related tables (MultiDataset) with primary and foreign keys. This is useful when you have a clear idea of the data you need but no source data to sample from.
Key Features
- No Source Data: Generate tables and databases purely from prompt-defined schemas.
- Prompt-Defined Schemas: Describe each table and column in natural language; the LLM fills in plausible values.
- Single- and Multi-Table: Define one table or several related tables with primary and foreign keys.
- Rich Column Types: Support for string, integer, float, date, datetime, category, and boolean, with optional value sets or constraints.
- Relationship Control: Use foreign key prompts to control cardinality (e.g. “each customer has between 2 and 3 orders”).
- Configurable Model: Use different LLM backends (e.g.
openai/gpt-5-nano) via the model parameter. - Structured Output: Get a
Datasetfor single-table or aMultiDatasetfor multi-table results.
Use Cases
- Rapid Schema Prototyping: Turn a textual description of a table or database into concrete sample data.
- Demo and Seed Data: Generate demo or seed data for applications and dashboards.
- Relational Fixtures: Create consistent relational fixtures for integration and end-to-end tests.
- Synthetic Databases: Build small synthetic databases when only a description of the schema and domain is available.
Best Practices
- Use Clear, Specific Prompts: Per-table and per-column prompts improve relevance and consistency of generated values.
- Control Cost with sample_size: Use a small
sample_sizefor experimentation and to limit API usage. - Specify Dtypes and Constraints: Set dtypes and, where needed, value lists or ranges to keep outputs valid.
- Describe Relationships in Foreign Key Prompts: Use the foreign key prompt to state cardinality and semantics (e.g. number of related rows per parent).
Schema from a JSON file
The tables argument to fit(tables=...) is a nested dictionary that you can load from a local JSON file. Load the file with tables = json.load(open(path)) (or pathlib.Path(path).read_text() and json.loads()), then pass the result to fit(tables=...).
Structure:
- Top level: Object mapping table names to table configs.
- Per table:
prompt(required): string describing the table.columns(required): object mapping column names to column configs.primary_key(optional): string column name.foreign_keys(optional): array of objects withcolumn,referenced_table, andprompt.- Per column: either
promptanddtype, ordtype: "category"andvalues(array), and optionally bothpromptanddtypewithvaluesfor constrained choices.
Example (single-table, credit card transactions):
{
"transactions": {
"prompt": "Credit card transactions for a financial services dataset",
"columns": {
"transaction_id": { "prompt": "unique identifier for the transaction", "dtype": "string" },
"card_id": { "prompt": "identifier of the credit card", "dtype": "string" },
"date": { "prompt": "transaction date", "dtype": "date" },
"merchant": { "prompt": "merchant or vendor name", "dtype": "string" },
"amount": { "prompt": "transaction amount", "dtype": "float" },
"currency": { "prompt": "currency code of the transaction", "dtype": "category", "values": ["USD", "EUR", "GBP"] },
"category": { "prompt": "spending category", "dtype": "category", "values": ["retail", "travel", "dining", "utilities", "other"] }
}
}
}
Advanced Usage
The LLM Synthesizer requires an API key for the chosen provider (e.g. OPENAI_API_KEY for OpenAI).
You pass a tables structure to fit(tables=...): each table has a prompt, a columns dict (column name to prompt, dtype, and optionally values), and for multi-table setups a primary_key and optional foreign_keys list. After fitting, call sample(sample_size=...) to get a Dataset (single table) or MultiDataset (multiple tables).
Loading in Python: tables = json.load(open(path)) then synth.fit(tables=tables).
Related Materials
