Text to dataset

Overview

YData SDK supports text to dataset generation: creating tabular data and multi-table databases from natural language descriptions without an existing dataset. Using the LLM Synthesizer, you describe tables and columns in prompts; the model produces a single table (Dataset) or multiple related tables (MultiDataset) with primary and foreign keys. This is useful when you have a clear idea of the data you need but no source data to sample from.

Key Features

No Source Data: Generate tables and databases purely from prompt-defined schemas.
Prompt-Defined Schemas: Describe each table and column in natural language; the LLM fills in plausible values.
Single- and Multi-Table: Define one table or several related tables with primary and foreign keys.
Rich Column Types: Support for string, integer, float, date, datetime, category, and boolean, with optional value sets or constraints.
Relationship Control: Use foreign key prompts to control cardinality (e.g. “each customer has between 2 and 3 orders”).
Configurable Model: Use different LLM backends (e.g. openai/gpt-5-nano) via the model parameter.
Structured Output: Get a Dataset for single-table or a MultiDataset for multi-table results.

Use Cases

Rapid Schema Prototyping: Turn a textual description of a table or database into concrete sample data.
Demo and Seed Data: Generate demo or seed data for applications and dashboards.
Relational Fixtures: Create consistent relational fixtures for integration and end-to-end tests.
Synthetic Databases: Build small synthetic databases when only a description of the schema and domain is available.

Best Practices

Use Clear, Specific Prompts: Per-table and per-column prompts improve relevance and consistency of generated values.
Control Cost with sample_size: Use a small sample_size for experimentation and to limit API usage.
Specify Dtypes and Constraints: Set dtypes and, where needed, value lists or ranges to keep outputs valid.
Describe Relationships in Foreign Key Prompts: Use the foreign key prompt to state cardinality and semantics (e.g. number of related rows per parent).

Schema from a JSON file

The tables argument to fit(tables=...) is a nested dictionary that you can load from a local JSON file. Load the file with tables = json.load(open(path)) (or pathlib.Path(path).read_text() and json.loads()), then pass the result to fit(tables=...).

Structure:

Top level: Object mapping table names to table configs.
Per table:
prompt (required): string describing the table.
columns (required): object mapping column names to column configs.
primary_key (optional): string column name.
foreign_keys (optional): array of objects with column, referenced_table, and prompt.
Per column: either prompt and dtype, or dtype: "category" and values (array), and optionally both prompt and dtype with values for constrained choices.

Example (single-table, credit card transactions):

{
  "transactions": {
    "prompt": "Credit card transactions for a financial services dataset",
    "columns": {
      "transaction_id": { "prompt": "unique identifier for the transaction", "dtype": "string" },
      "card_id": { "prompt": "identifier of the credit card", "dtype": "string" },
      "date": { "prompt": "transaction date", "dtype": "date" },
      "merchant": { "prompt": "merchant or vendor name", "dtype": "string" },
      "amount": { "prompt": "transaction amount", "dtype": "float" },
      "currency": { "prompt": "currency code of the transaction", "dtype": "category", "values": ["USD", "EUR", "GBP"] },
      "category": { "prompt": "spending category", "dtype": "category", "values": ["retail", "travel", "dining", "utilities", "other"] }
    }
  }
}

Advanced Usage

The LLM Synthesizer requires an API key for the chosen provider (e.g. OPENAI_API_KEY for OpenAI).

You pass a tables structure to fit(tables=...): each table has a prompt, a columns dict (column name to prompt, dtype, and optionally values), and for multi-table setups a primary_key and optional foreign_keys list. After fitting, call sample(sample_size=...) to get a Dataset (single table) or MultiDataset (multiple tables).

Loading in Python: tables = json.load(open(path)) then synth.fit(tables=tables).

Related Materials

Text to Dataset getting started guide