Skip to content

DatasetConfig reference

DatasetConfig is the batch configuration object for DocumentGenerator.generate_dataset(). One instance describes how many documents to create and which parameters apply to each run of generate().

There are two main modes:

Mode When Content LLM
Variation batch data is missing or empty; you pass n_docs with optional base / variations / constraints Runs per document (full document body from the model unless you only use layout paths).
Pre-built payloads data is a non-empty list of dicts Skipped for the document body; each dict is serialized as the content input (layout/HTML steps still apply).

Fields

document_type (required)

String label for the kind of document (for example "Invoice", "Earnings Summary"). It is passed on every generate() call and is not part of base.

base

Optional dictionary of parameters shared by every document in the batch. Keys must align with arguments accepted by generate() for steering tone and content. The implementation treats these keys as first-class generation parameters:

  • audience
  • tone
  • purpose
  • region
  • language
  • length
  • topics
  • style_guide

Anything you put in variations or in a profile entry that is not in that set is still carried into each job; those extra keys are appended to the topics text (as key: value lines) so the LLM sees them without adding new generate() parameters.

variations

Optional. Describes how values differ per document when expanding a variation batch. For each key (except the reserved key profile), the value must be one of:

  1. Weighted dict — maps string values to positive weights, for example {"formal": 0.7, "neutral": 0.3} for tone. The batch allocator fills roughly proportional counts per value (integer rounding adjusts one bucket so the total matches n_docs).
  2. Uniform list of strings — for example "audience": ["Board of directors", "Public investors"]. Values are spread as evenly as possible across n_docs.

Variation keys are processed in sorted order so expansion is stable. Arrays are shuffled with the random stream seeded by generate_dataset(..., seed=...), so the same seed and n_docs give repeatable assignments.

The profile key

profile is special: its value is a list of dictionaries. Each dict describes a bundle of fields (for example vendor_type, audience, topics) that apply together. Each profile may include an optional numeric weight for a weighted mix across profiles; invalid or partial weights fall back to an even split with a logged warning.

Merge order for each document:

  1. Start from the profile row (if profile is used).
  2. Apply scalar variations on top.
  3. If the same key appears in both, scalar variations wins.

constraints

Optional. Structured hints turned into short bullet lines and merged into the prompt via topics. Supported keys:

Key Type Effect
must_include list of strings Adds an “Include: …” line listing the phrases.
min_items int Adds “Include at least N items”.
max_items int Adds “Include at most N items”.

Other keys are ignored with a logged warning.

data

Optional list of dicts, one dict per output document. Values should be JSON-serializable (they are serialized when building the content row). When data is non-empty:

  • The batch size is len(data) (not n_docs).
  • base, variations, and constraints may be omitted or empty.
  • Use this for invoices, reports, or any case where the body is already structured; see Document generation from existing data.

Validation

DatasetConfig raises ValueError if data is not provided or is empty and base is empty:

DatasetConfig requires a non-empty base when data is not provided or is empty. Provide at least one shared parameter in base, or pass pre-generated rows in data.

When data is non-empty, an empty base is allowed.

generate_dataset behavior

When data is unset or empty:

  • n_docs must be a positive integer.
  • Optional seed controls deterministic variation expansion.
  • Optional return_metadata (default True) returns one metadata dict per document (index, paths, merged parameters).
  • output_dir may be omitted to use a temporary directory.

When data is set:

  • n_docs is taken from len(config.data); an explicit n_docs is not used for sizing the batch.
  • Optional max_workers greater than 1 enables parallel generation per row, capped by the number of rows and an internal ceiling (currently 8).

Workbench subscription and API usage match other document features; see the getting started guide.

Examples

Variation batch (profiles, weighted tone, constraints):

"""
generate_dataset Example

Demonstrates how to use DatasetConfig + generate_dataset to produce
multiple documents with controlled variation across profiles, tones,
and constraints -- all in a single call.
"""
import os

from ydata.synthesizers.text.model.document import DocumentGenerator, DocumentFormat
from ydata.synthesizers.text.model.utils.dataset_config import DatasetConfig

if __name__ == "__main__":

    generator = DocumentGenerator(document_format=DocumentFormat.PDF)

    # -----------------------------------------------------------------
    # Example 1 — Invoice dataset with weighted profiles + tone mix
    # -----------------------------------------------------------------
    # Three vendor profiles at 50/30/20 split, crossed with a 70/30
    # formal-vs-neutral tone distribution.  Constraints ensure every
    # invoice contains a tax line and a total.

    invoice_config = DatasetConfig(
        document_type="Invoice",
        base={
            "language": "English",
            "region": "North America",
            "purpose": "Detailed invoice for services or goods",
            "length": "Long",
        },
        variations={
            "profile": [
                {
                    "weight": 0.5,
                    "vendor_type": "Consulting firm",
                    "audience": "Corporate client",
                    "topics": "Consulting services, hourly rates, project milestones",
                    "style_guide": "Professional corporate invoice",
                },
                {
                    "weight": 0.3,
                    "vendor_type": "Supermarket",
                    "audience": "Retail customer",
                    "topics": "Groceries, household items, unit prices",
                    "style_guide": "Clean receipt-style invoice",
                },
                {
                    "weight": 0.2,
                    "vendor_type": "SaaS company",
                    "audience": "Business customer",
                    "topics": "Subscription fees, license seats, billing cycle",
                    "style_guide": "Modern SaaS invoice with usage breakdown",
                },
            ],
            "tone": {"formal": 0.7, "neutral": 0.3},
        },
        constraints={
            "must_include": ["total", "tax"],
            "min_items": 5,
        },
    )

    print("=== Generating 10 invoices (profiles + tone mix) ===")
    metadata = generator.generate_dataset(
        config=invoice_config,
        n_docs=10,
        output_dir="output/dataset_invoices",
        seed=42,
    )

    print(f"\nGenerated {len(metadata)} documents. Sample metadata:")
    for m in metadata[:3]:
        print(f"  #{m['index']}: vendor_type={m.get('vendor_type')}, "
              f"tone={m.get('tone')}, audience={m.get('audience')}")

    # -----------------------------------------------------------------
    # Example 2 — Simple report dataset with uniform audience rotation
    # -----------------------------------------------------------------
    # No profiles, just a uniform list of audiences and a single tone.

    report_config = DatasetConfig(
        document_type="Earnings Summary",
        base={
            "language": "English",
            "region": "Global",
            "purpose": "Quarterly earnings overview for stakeholders",
            "tone": "professional",
            "length": "Medium",
            "style_guide": "Corporate investor relations format",
        },
        variations={
            "audience": [
                "Board of directors",
                "Public investors",
                "Internal analysts",
            ],
        },
    )

    print("\n=== Generating 6 earnings summaries (audience rotation) ===")
    metadata = generator.generate_dataset(
        config=report_config,
        n_docs=6,
        output_dir="output/dataset_reports",
        seed=7,
    )

    print(f"\nGenerated {len(metadata)} documents. Audiences used:")
    for m in metadata:
        print(f"  #{m['index']}: audience={m['audience']}")

    # -----------------------------------------------------------------
    # Example 3 — Base-only config (no variations, no constraints)
    # -----------------------------------------------------------------
    # Equivalent to calling generate(n_docs=3, ...) but via DatasetConfig.

    simple_config = DatasetConfig(
        document_type="Credit Card Statement",
        base={
            "audience": "Individual cardholder",
            "tone": "formal",
            "purpose": "Monthly credit card statement",
            "region": "United States",
            "language": "English",
            "length": "Long",
            "topics": "Transactions, payment due, rewards summary",
            "style_guide": "Financial institution statement layout",
        },
    )

    print("\n=== Generating 3 credit card statements (base only) ===")
    generator.generate_dataset(
        config=simple_config,
        n_docs=3,
        output_dir="output/dataset_statements",
        return_metadata=False,
    )
    print("Done (no metadata returned).")

Pre-built content rows (DatasetConfig.data):

"""
Document Generator — pre-generated JSON content

Shows how to skip the content-generation LLM by passing structured payloads on
:class:`DatasetConfig` as ``data``: a list of dicts, one per
document. HTML template + inject steps still run (Workbench subscription key).
"""
import os

from ydata.synthesizers.text.model.document import DocumentGenerator, DocumentFormat
from ydata.synthesizers.text.model.utils.dataset_config import DatasetConfig

if __name__ == "__main__":
    generator = DocumentGenerator(
        document_format=DocumentFormat.PDF,
        subscription_key="add-workbench-key",
    )

    out_single = "output/pre_generated_single"
    os.makedirs(out_single, exist_ok=True)

    # ------------------------------------------------------------------
    # 1) generate() — document_type + data (+ output_dir)
    # ------------------------------------------------------------------
    print("=== Single document from pre-built JSON (content agent skipped) ===")
    invoice_body = {
        "vendor": "Example Corp",
        "line_items": [
            {"description": "Consulting", "amount": 1500.00},
            {"description": "Support", "amount": 200.00},
        ],
        "tax_rate": 0.08,
        "total": 1700.00,
        "payment_method": "Credit card",
        "payment_date": "2026-04-03",
        "payment_status": "Paid",
        "payment_amount": 1700.00,
        "payment_currency": "USD",
        "payment_transaction_id": "1234567890",
        "payment_transaction_date": "2026-04-03",
        "notes": "Thank you for your business.",
    }
    generator.generate(
        document_type="Invoice",
        data=invoice_body,
        output_dir=out_single,
    )
    # ------------------------------------------------------------------
    # 2) generate_dataset() — DatasetConfig with data=[{...}, {...}]
    # ------------------------------------------------------------------
    out_batch = "output/pre_generated_batch"
    os.makedirs(out_batch, exist_ok=True)

    batch_config = DatasetConfig(
        document_type="Invoice",
        data=[
            {"vendor": "Vendor A", "total": 100.0},
            {"vendor": "Vendor B", "total": 250.0},
        ],
    )

    print("\n=== Batch: config.data = list of dicts (content agent skipped) ===")
    metadata = generator.generate_dataset(
        config=batch_config,
        output_dir=out_batch,
    )
    for row in metadata:
        print(f"  index={row['index']} document_name={row['document_name']}")

Limitations

DatasetConfig controls batch parameters and prompt text, not rigid schemas: it does not enforce presentation-deck slide masters, legal clause libraries, or section-by-section contracts (for example full SOW compliance). For those outputs, treat generated files as drafts and apply template or legal review outside the SDK.