Skip to content

Document Generator

ydata.synthesizers.text.model.document.DocumentFormat

Bases: Enum

Enum representing supported output formats for synthetic document generation.

Attributes:

Name Type Description
DOCX

Microsoft Word document format (docx)

PDF

Portable Document Format (pdf)

HTML

HyperText Markup Language format (html)

ydata.synthesizers.text.model.utils.dataset_config.DatasetConfig dataclass

Configuration for dataset-level document generation.

Attributes:

Name Type Description
document_type str

Type of document to generate (e.g. "Invoice", "Report").

base Optional[Dict[str, str]]

Shared parameters applied to every document. Keys map to the existing generate() params (audience, tone, purpose, region, language, length, topics, style_guide).

variations Optional[Dict[str, Union[Dict[str, float], List[str], List[Dict]]]]

Fields that change across documents. Each key maps to either a weighted dict {"value": weight}, a uniform list ["value_a", "value_b"], or (for the special "profile" key) a list of dicts where each dict describes a realistic document profile with optional weight. Keys present here override the corresponding base value per-document.

constraints Optional[Dict[str, Union[int, List[str]]]]

Structured directives translated into prompt text. Supported keys: must_include (list of strings), min_items (int), max_items (int).

data Optional[List[Dict[str, Any]]]

Optional[List[Dict[str, Any]]] = None per-document payloads (JSON-serializable dicts), one row per document. When set, the content-generation LLM is skipped; batch size is len(data)

Validation
  • If data is non-empty, base / variations / constraints are optional (may be omitted or empty).
  • If data is absent or empty, base must contain at least one entry.

ydata.synthesizers.text.model.document.DocumentGenerator

Synthetic document generator that creates documents in various formats (DOCX, PDF, HTML) based on input specifications.

Each generation step is delegated to an :class:Agent backed by the Workbench API. Agents validate every LLM response against Pydantic output models to guarantee structured, reliable outputs.

Parameters:

Name Type Description Default
document_format Optional[Union[DocumentFormat, str]]

Output format for generated documents.

PDF
template_img_path Optional[str]

Path to a reference image used to derive an HTML template (optional).

None
subscription_key Optional[str]

Workbench API subscription key. Required unless the installed package provides an embedded default (internal builds).

None
charging_code Optional[str]

Workbench x-kpmg-charge-code header for API calls. When omitted, uses WORKBENCH_CHARGING_CODE if set, otherwise "0".

None

generate(document_type=None, audience=None, tone=None, purpose=None, region=None, language=None, length=None, topics=None, style_guide=None, data=None, output_dir=None, **kwargs)

Generate documents based on input specifications.

Each call produces exactly one output document.

When data is provided, only document_type is required. User data is serialized as the document body; the content-generation LLM is skipped. Other parameters are optional hints for HTML generation.

When data is None, specifications are sampled/validated as before and the content LLM runs before HTML generation.

Parameters:

Name Type Description Default
document_type str | None

Type of document to generate (required if data is set; required for the LLM path when data is None).

None
audience str | None

Target audience for the document

None
tone str | ToneCategory | None

Desired tone (formal, casual, etc.)

None
purpose str | None

Purpose of the document

None
region str | None

Target region/locale

None
language str | None

Language of the document

None
length str | None

Desired length of the document

None
topics str | None

Key points to cover

None
style_guide str | None

Style guide to follow

None
data Optional[Dict[str, Any]]

Structured payload for a single document; when set, body text is taken from this dict (JSON-serialized), not from the content agent.

None
output_dir Optional[str]

Directory to store generated documents

None
**kwargs

Reserved for forward compatibility

{}

Raises:

Type Description
ValueError

If input validation fails or document format is unsupported

generate_dataset(config, n_docs=None, output_dir=None, seed=None, return_metadata=True, max_workers=None)

Generate multiple documents from a :class:DatasetConfig.

This is an orchestrator that expands the config into n_docs per-document parameter sets (respecting weighted/uniform variation distributions) and calls :meth:generate once per document.

Parameters:

Name Type Description Default
config DatasetConfig

Dataset-level configuration describing base params, variations, and optional constraints.

required
n_docs Optional[int]

Number of documents to generate when config.data is None. Required and must be positive in that case. When config.data is set, n_docs is taken from len(config.data) and any explicit n_docs is ignored.

None
output_dir Optional[str]

Directory to store generated documents. A temporary directory is created when not provided.

None
seed Optional[int]

Optional seed for deterministic variation expansion.

None
return_metadata bool

If True (default), return a list of metadata dicts describing the parameters used for each document.

True
max_workers Optional[int]

When config.data is set, optionally run up to this many document generations concurrently (capped at len(config.data) and a small fixed limit). None or 1 keeps sequential execution.

None

Returns:

Type Description
Optional[List[Dict[str, Any]]]

A list of metadata dicts (one per document) when

Optional[List[Dict[str, Any]]]

return_metadata is True, otherwise None.

Raises:

Type Description
ValueError

If document count is undefined or invalid (empty config.data, or missing/non-positive n_docs when config.data is not used).

generate_from_template(document_type, information, output_dir)

Generate documents by applying content to an image-derived HTML template.