Document Generator

`ydata.synthesizers.text.model.document.DocumentFormat`

Bases: Enum

Enum representing supported output formats for synthetic document generation.

Attributes:

Name	Type	Description
`DOCX`		Microsoft Word document format (docx)
`PDF`		Portable Document Format (pdf)
`HTML`		HyperText Markup Language format (html)

`ydata.synthesizers.text.model.utils.dataset_config.DatasetConfig` `dataclass`

Configuration for dataset-level document generation.

Attributes:

Name	Type	Description
`document_type`	`str`	Type of document to generate (e.g. "Invoice", "Report").
`base`	`Optional[Dict[str, str]]`	Shared parameters applied to every document. Keys map to the existing `generate()` params (audience, tone, purpose, region, language, length, topics, style_guide).
`variations`	`Optional[Dict[str, Union[Dict[str, float], List[str], List[Dict]]]]`	Fields that change across documents. Each key maps to either a weighted dict `{"value": weight}`, a uniform list `["value_a", "value_b"]`, or (for the special `"profile"` key) a list of dicts where each dict describes a realistic document profile with optional `weight`. Keys present here override the corresponding `base` value per-document.
`constraints`	`Optional[Dict[str, Union[int, List[str]]]]`	Structured directives translated into prompt text. Supported keys: `must_include` (list of strings), `min_items` (int), `max_items` (int).
`data`	`Optional[List[Dict[str, Any]]]`	Optional[List[Dict[str, Any]]] = None per-document payloads (JSON-serializable dicts), one row per document. When set, the content-generation LLM is skipped; batch size is `len(data)`

Validation

If data is non-empty, base / variations / constraints are optional (may be omitted or empty).
If data is absent or empty, base must contain at least one entry.

`ydata.synthesizers.text.model.document.DocumentGenerator`

Synthetic document generator that creates documents in various formats (DOCX, PDF, HTML) based on input specifications.

Each generation step is delegated to an :class:Agent backed by the Workbench API. Agents validate every LLM response against Pydantic output models to guarantee structured, reliable outputs.

Parameters:

Name	Type	Description	Default
`document_format`	`Optional[Union[DocumentFormat, str]]`	Output format for generated documents.	`PDF`
`template_img_path`	`Optional[str]`	Path to a reference image used to derive an HTML template (optional).	`None`
`subscription_key`	`Optional[str]`	Workbench API subscription key. Required unless the installed package provides an embedded default (internal builds).	`None`
`charging_code`	`Optional[str]`	Workbench `x-kpmg-charge-code` header for API calls. When omitted, uses `WORKBENCH_CHARGING_CODE` if set, otherwise `"0"`.	`None`

`generate(document_type=None, audience=None, tone=None, purpose=None, region=None, language=None, length=None, topics=None, style_guide=None, data=None, output_dir=None, **kwargs)`

Generate documents based on input specifications.

Each call produces exactly one output document.

When data is provided, only document_type is required. User data is serialized as the document body; the content-generation LLM is skipped. Other parameters are optional hints for HTML generation.

When data is None, specifications are sampled/validated as before and the content LLM runs before HTML generation.

Parameters:

Name	Type	Description	Default
`document_type`	`str \| None`	Type of document to generate (required if `data` is set; required for the LLM path when `data` is `None`).	`None`
`audience`	`str \| None`	Target audience for the document	`None`
`tone`	`str \| ToneCategory \| None`	Desired tone (formal, casual, etc.)	`None`
`purpose`	`str \| None`	Purpose of the document	`None`
`region`	`str \| None`	Target region/locale	`None`
`language`	`str \| None`	Language of the document	`None`
`length`	`str \| None`	Desired length of the document	`None`
`topics`	`str \| None`	Key points to cover	`None`
`style_guide`	`str \| None`	Style guide to follow	`None`
`data`	`Optional[Dict[str, Any]]`	Structured payload for a single document; when set, body text is taken from this dict (JSON-serialized), not from the content agent.	`None`
`output_dir`	`Optional[str]`	Directory to store generated documents	`None`
`**kwargs`		Reserved for forward compatibility	`{}`

Raises:

Type	Description
`ValueError`	If input validation fails or document format is unsupported

`generate_dataset(config, n_docs=None, output_dir=None, seed=None, return_metadata=True, max_workers=None)`

Generate multiple documents from a :class:DatasetConfig.

This is an orchestrator that expands the config into n_docs per-document parameter sets (respecting weighted/uniform variation distributions) and calls :meth:generate once per document.

Parameters:

Name	Type	Description	Default
`config`	`DatasetConfig`	Dataset-level configuration describing base params, variations, and optional constraints.	required
`n_docs`	`Optional[int]`	Number of documents to generate when `config.data` is `None`. Required and must be positive in that case. When `config.data` is set, `n_docs` is taken from `len(config.data)` and any explicit n_docs is ignored.	`None`
`output_dir`	`Optional[str]`	Directory to store generated documents. A temporary directory is created when not provided.	`None`
`seed`	`Optional[int]`	Optional seed for deterministic variation expansion.	`None`
`return_metadata`	`bool`	If `True` (default), return a list of metadata dicts describing the parameters used for each document.	`True`
`max_workers`	`Optional[int]`	When `config.data` is set, optionally run up to this many document generations concurrently (capped at `len(config.data)` and a small fixed limit). `None` or `1` keeps sequential execution.	`None`

Returns:

Type	Description
`Optional[List[Dict[str, Any]]]`	A list of metadata dicts (one per document) when
`Optional[List[Dict[str, Any]]]`	return_metadata is `True`, otherwise `None`.

Raises:

Type	Description
`ValueError`	If document count is undefined or invalid (empty `config.data`, or missing/non-positive `n_docs` when `config.data` is not used).

`generate_from_template(document_type, information, output_dir)`

Generate documents by applying content to an image-derived HTML template.

Document Generator

ydata.synthesizers.text.model.document.DocumentFormat

ydata.synthesizers.text.model.utils.dataset_config.DatasetConfig dataclass

ydata.synthesizers.text.model.document.DocumentGenerator

generate(document_type=None, audience=None, tone=None, purpose=None, region=None, language=None, length=None, topics=None, style_guide=None, data=None, output_dir=None, **kwargs)

generate_dataset(config, n_docs=None, output_dir=None, seed=None, return_metadata=True, max_workers=None)

generate_from_template(document_type, information, output_dir)

`ydata.synthesizers.text.model.document.DocumentFormat`

`ydata.synthesizers.text.model.utils.dataset_config.DatasetConfig` `dataclass`

`ydata.synthesizers.text.model.document.DocumentGenerator`

`generate(document_type=None, audience=None, tone=None, purpose=None, region=None, language=None, length=None, topics=None, style_guide=None, data=None, output_dir=None, **kwargs)`

`generate_dataset(config, n_docs=None, output_dir=None, seed=None, return_metadata=True, max_workers=None)`

`generate_from_template(document_type, information, output_dir)`