DatasetConfig reference
DatasetConfig is the batch configuration object for DocumentGenerator.generate_dataset(). One instance describes how many documents to create and which parameters apply to each run of generate().
There are two main modes:
| Mode | When | Content LLM |
|---|---|---|
| Variation batch | data is missing or empty; you pass n_docs with optional base / variations / constraints |
Runs per document (full document body from the model unless you only use layout paths). |
| Pre-built payloads | data is a non-empty list of dicts |
Skipped for the document body; each dict is serialized as the content input (layout/HTML steps still apply). |
Fields
document_type (required)
String label for the kind of document (for example "Invoice", "Earnings Summary"). It is passed on every generate() call and is not part of base.
base
Optional dictionary of parameters shared by every document in the batch. Keys must align with arguments accepted by generate() for steering tone and content. The implementation treats these keys as first-class generation parameters:
audiencetonepurposeregionlanguagelengthtopicsstyle_guide
Anything you put in variations or in a profile entry that is not in that set is still carried into each job; those extra keys are appended to the topics text (as key: value lines) so the LLM sees them without adding new generate() parameters.
variations
Optional. Describes how values differ per document when expanding a variation batch. For each key (except the reserved key profile), the value must be one of:
- Weighted dict — maps string values to positive weights, for example
{"formal": 0.7, "neutral": 0.3}fortone. The batch allocator fills roughly proportional counts per value (integer rounding adjusts one bucket so the total matchesn_docs). - Uniform list of strings — for example
"audience": ["Board of directors", "Public investors"]. Values are spread as evenly as possible acrossn_docs.
Variation keys are processed in sorted order so expansion is stable. Arrays are shuffled with the random stream seeded by generate_dataset(..., seed=...), so the same seed and n_docs give repeatable assignments.
The profile key
profile is special: its value is a list of dictionaries. Each dict describes a bundle of fields (for example vendor_type, audience, topics) that apply together. Each profile may include an optional numeric weight for a weighted mix across profiles; invalid or partial weights fall back to an even split with a logged warning.
Merge order for each document:
- Start from the profile row (if
profileis used). - Apply scalar
variationson top. - If the same key appears in both, scalar
variationswins.
constraints
Optional. Structured hints turned into short bullet lines and merged into the prompt via topics. Supported keys:
| Key | Type | Effect |
|---|---|---|
must_include |
list of strings | Adds an “Include: …” line listing the phrases. |
min_items |
int | Adds “Include at least N items”. |
max_items |
int | Adds “Include at most N items”. |
Other keys are ignored with a logged warning.
data
Optional list of dicts, one dict per output document. Values should be JSON-serializable (they are serialized when building the content row). When data is non-empty:
- The batch size is
len(data)(notn_docs). base,variations, andconstraintsmay be omitted or empty.- Use this for invoices, reports, or any case where the body is already structured; see Document generation from existing data.
Validation
DatasetConfig raises ValueError if data is not provided or is empty and base is empty:
DatasetConfig requires a non-empty
basewhendatais not provided or is empty. Provide at least one shared parameter inbase, or pass pre-generated rows indata.
When data is non-empty, an empty base is allowed.
generate_dataset behavior
When data is unset or empty:
n_docsmust be a positive integer.- Optional
seedcontrols deterministic variation expansion. - Optional
return_metadata(defaultTrue) returns one metadata dict per document (index, paths, merged parameters). output_dirmay be omitted to use a temporary directory.
When data is set:
n_docsis taken fromlen(config.data); an explicitn_docsis not used for sizing the batch.- Optional
max_workersgreater than1enables parallel generation per row, capped by the number of rows and an internal ceiling (currently 8).
Workbench subscription and API usage match other document features; see the getting started guide.
Examples
Variation batch (profiles, weighted tone, constraints):
"""
generate_dataset Example
Demonstrates how to use DatasetConfig + generate_dataset to produce
multiple documents with controlled variation across profiles, tones,
and constraints -- all in a single call.
"""
import os
from ydata.synthesizers.text.model.document import DocumentGenerator, DocumentFormat
from ydata.synthesizers.text.model.utils.dataset_config import DatasetConfig
if __name__ == "__main__":
generator = DocumentGenerator(document_format=DocumentFormat.PDF)
# -----------------------------------------------------------------
# Example 1 — Invoice dataset with weighted profiles + tone mix
# -----------------------------------------------------------------
# Three vendor profiles at 50/30/20 split, crossed with a 70/30
# formal-vs-neutral tone distribution. Constraints ensure every
# invoice contains a tax line and a total.
invoice_config = DatasetConfig(
document_type="Invoice",
base={
"language": "English",
"region": "North America",
"purpose": "Detailed invoice for services or goods",
"length": "Long",
},
variations={
"profile": [
{
"weight": 0.5,
"vendor_type": "Consulting firm",
"audience": "Corporate client",
"topics": "Consulting services, hourly rates, project milestones",
"style_guide": "Professional corporate invoice",
},
{
"weight": 0.3,
"vendor_type": "Supermarket",
"audience": "Retail customer",
"topics": "Groceries, household items, unit prices",
"style_guide": "Clean receipt-style invoice",
},
{
"weight": 0.2,
"vendor_type": "SaaS company",
"audience": "Business customer",
"topics": "Subscription fees, license seats, billing cycle",
"style_guide": "Modern SaaS invoice with usage breakdown",
},
],
"tone": {"formal": 0.7, "neutral": 0.3},
},
constraints={
"must_include": ["total", "tax"],
"min_items": 5,
},
)
print("=== Generating 10 invoices (profiles + tone mix) ===")
metadata = generator.generate_dataset(
config=invoice_config,
n_docs=10,
output_dir="output/dataset_invoices",
seed=42,
)
print(f"\nGenerated {len(metadata)} documents. Sample metadata:")
for m in metadata[:3]:
print(f" #{m['index']}: vendor_type={m.get('vendor_type')}, "
f"tone={m.get('tone')}, audience={m.get('audience')}")
# -----------------------------------------------------------------
# Example 2 — Simple report dataset with uniform audience rotation
# -----------------------------------------------------------------
# No profiles, just a uniform list of audiences and a single tone.
report_config = DatasetConfig(
document_type="Earnings Summary",
base={
"language": "English",
"region": "Global",
"purpose": "Quarterly earnings overview for stakeholders",
"tone": "professional",
"length": "Medium",
"style_guide": "Corporate investor relations format",
},
variations={
"audience": [
"Board of directors",
"Public investors",
"Internal analysts",
],
},
)
print("\n=== Generating 6 earnings summaries (audience rotation) ===")
metadata = generator.generate_dataset(
config=report_config,
n_docs=6,
output_dir="output/dataset_reports",
seed=7,
)
print(f"\nGenerated {len(metadata)} documents. Audiences used:")
for m in metadata:
print(f" #{m['index']}: audience={m['audience']}")
# -----------------------------------------------------------------
# Example 3 — Base-only config (no variations, no constraints)
# -----------------------------------------------------------------
# Equivalent to calling generate(n_docs=3, ...) but via DatasetConfig.
simple_config = DatasetConfig(
document_type="Credit Card Statement",
base={
"audience": "Individual cardholder",
"tone": "formal",
"purpose": "Monthly credit card statement",
"region": "United States",
"language": "English",
"length": "Long",
"topics": "Transactions, payment due, rewards summary",
"style_guide": "Financial institution statement layout",
},
)
print("\n=== Generating 3 credit card statements (base only) ===")
generator.generate_dataset(
config=simple_config,
n_docs=3,
output_dir="output/dataset_statements",
return_metadata=False,
)
print("Done (no metadata returned).")
Pre-built content rows (DatasetConfig.data):
"""
Document Generator — pre-generated JSON content
Shows how to skip the content-generation LLM by passing structured payloads on
:class:`DatasetConfig` as ``data``: a list of dicts, one per
document. HTML template + inject steps still run (Workbench subscription key).
"""
import os
from ydata.synthesizers.text.model.document import DocumentGenerator, DocumentFormat
from ydata.synthesizers.text.model.utils.dataset_config import DatasetConfig
if __name__ == "__main__":
generator = DocumentGenerator(
document_format=DocumentFormat.PDF,
subscription_key="add-workbench-key",
)
out_single = "output/pre_generated_single"
os.makedirs(out_single, exist_ok=True)
# ------------------------------------------------------------------
# 1) generate() — document_type + data (+ output_dir)
# ------------------------------------------------------------------
print("=== Single document from pre-built JSON (content agent skipped) ===")
invoice_body = {
"vendor": "Example Corp",
"line_items": [
{"description": "Consulting", "amount": 1500.00},
{"description": "Support", "amount": 200.00},
],
"tax_rate": 0.08,
"total": 1700.00,
"payment_method": "Credit card",
"payment_date": "2026-04-03",
"payment_status": "Paid",
"payment_amount": 1700.00,
"payment_currency": "USD",
"payment_transaction_id": "1234567890",
"payment_transaction_date": "2026-04-03",
"notes": "Thank you for your business.",
}
generator.generate(
document_type="Invoice",
data=invoice_body,
output_dir=out_single,
)
# ------------------------------------------------------------------
# 2) generate_dataset() — DatasetConfig with data=[{...}, {...}]
# ------------------------------------------------------------------
out_batch = "output/pre_generated_batch"
os.makedirs(out_batch, exist_ok=True)
batch_config = DatasetConfig(
document_type="Invoice",
data=[
{"vendor": "Vendor A", "total": 100.0},
{"vendor": "Vendor B", "total": 250.0},
],
)
print("\n=== Batch: config.data = list of dicts (content agent skipped) ===")
metadata = generator.generate_dataset(
config=batch_config,
output_dir=out_batch,
)
for row in metadata:
print(f" index={row['index']} document_name={row['document_name']}")
Limitations
DatasetConfig controls batch parameters and prompt text, not rigid schemas: it does not enforce presentation-deck slide masters, legal clause libraries, or section-by-section contracts (for example full SOW compliance). For those outputs, treat generated files as drafts and apply template or legal review outside the SDK.
Related
- Document generation from existing data
- Synthetic documents overview
- API: DocumentGenerator and DatasetConfig
