Skip to content

Document generation from existing data

DocumentGenerator can build PDF, DOCX, or HTML in two ways:

  1. LLM-written body - the user provides inputs that guides the generation of the document content and format.
  2. Your structured payload — the user pass a JSON-serializable dict (or a list of dicts) as the document body. The generated documen is injected with the provided data. Specially useful for documents such as invoices or SOWs where there is structured content that must be respected.

Single document: generate(..., data=...)

  • document_type is required and must be non-empty when data is set (for example "Invoice", "Report").
  • data is a single dict: your business payload (nested structures are fine as long as they are JSON-serializable when exported).
  • Other parameters (audience, tone, purpose, …) are optional hints for HTML generation.

Batch: DatasetConfig and generate_dataset

For a full reference (base, variations, profile, constraints, and the pre-built data mode), see the DatasetConfig reference.

In the existing-data mode, DatasetConfig.data is a list of dicts, one row per document. generate_dataset(config=..., output_dir=...) calls generate once per row with that row as data.

  • If config.data is set, the number of documents is len(config.data); an explicit n_docs is not used in that mode.
  • Optional max_workers set to more than 1 runs multiple documents concurrently (bounded by the implementation).

Tabular data and documents

If your source is tabular synthetic data, convert rows to dicts (for example DataFrame.to_dict("records")) and pass them through DatasetConfig.data or per-call data=. This is separate from LLMSynthesizer.fit(..., existing_data=...), which enriches tables with new columns for existing rows—use that API for tabular-only workflows.

Full example

"""
Document Generator — pre-generated JSON content

Shows how to skip the content-generation LLM by passing structured payloads on
:class:`DatasetConfig` as ``data``: a list of dicts, one per
document. HTML template + inject steps still run (Workbench subscription key).
"""
import os

from ydata.synthesizers.text.model.document import DocumentGenerator, DocumentFormat
from ydata.synthesizers.text.model.utils.dataset_config import DatasetConfig

if __name__ == "__main__":
    generator = DocumentGenerator(
        document_format=DocumentFormat.PDF,
        subscription_key="add-workbench-key",
    )

    out_single = "output/pre_generated_single"
    os.makedirs(out_single, exist_ok=True)

    # ------------------------------------------------------------------
    # 1) generate() — document_type + data (+ output_dir)
    # ------------------------------------------------------------------
    print("=== Single document from pre-built JSON (content agent skipped) ===")
    invoice_body = {
        "vendor": "Example Corp",
        "line_items": [
            {"description": "Consulting", "amount": 1500.00},
            {"description": "Support", "amount": 200.00},
        ],
        "tax_rate": 0.08,
        "total": 1700.00,
        "payment_method": "Credit card",
        "payment_date": "2026-04-03",
        "payment_status": "Paid",
        "payment_amount": 1700.00,
        "payment_currency": "USD",
        "payment_transaction_id": "1234567890",
        "payment_transaction_date": "2026-04-03",
        "notes": "Thank you for your business.",
    }
    generator.generate(
        document_type="Invoice",
        data=invoice_body,
        output_dir=out_single,
    )
    # ------------------------------------------------------------------
    # 2) generate_dataset() — DatasetConfig with data=[{...}, {...}]
    # ------------------------------------------------------------------
    out_batch = "output/pre_generated_batch"
    os.makedirs(out_batch, exist_ok=True)

    batch_config = DatasetConfig(
        document_type="Invoice",
        data=[
            {"vendor": "Vendor A", "total": 100.0},
            {"vendor": "Vendor B", "total": 250.0},
        ],
    )

    print("\n=== Batch: config.data = list of dicts (content agent skipped) ===")
    metadata = generator.generate_dataset(
        config=batch_config,
        output_dir=out_batch,
    )
    for row in metadata:
        print(f"  index={row['index']} document_name={row['document_name']}")