Document Generator
ydata.synthesizers.text.model.document.DocumentFormat
Bases: Enum
Enum representing supported output formats for synthetic document generation.
Attributes:
| Name | Type | Description |
|---|---|---|
DOCX |
Microsoft Word document format (docx) |
|
PDF |
Portable Document Format (pdf) |
|
HTML |
HyperText Markup Language format (html) |
ydata.synthesizers.text.model.utils.dataset_config.DatasetConfig
dataclass
Configuration for dataset-level document generation.
Attributes:
| Name | Type | Description |
|---|---|---|
document_type |
str
|
Type of document to generate (e.g. "Invoice", "Report"). |
base |
Optional[Dict[str, str]]
|
Shared parameters applied to every document. Keys map to the
existing |
variations |
Optional[Dict[str, Union[Dict[str, float], List[str], List[Dict]]]]
|
Fields that change across documents. Each key maps to
either a weighted dict |
constraints |
Optional[Dict[str, Union[int, List[str]]]]
|
Structured directives translated into prompt text.
Supported keys: |
data |
Optional[List[Dict[str, Any]]]
|
Optional[List[Dict[str, Any]]] = None
per-document payloads (JSON-serializable dicts), one row per document.
When set, the content-generation LLM is skipped; batch size is
|
Validation
- If
datais non-empty,base/variations/constraintsare optional (may be omitted or empty). - If
datais absent or empty,basemust contain at least one entry.
ydata.synthesizers.text.model.document.DocumentGenerator
Synthetic document generator that creates documents in various formats (DOCX, PDF, HTML) based on input specifications.
Each generation step is delegated to an :class:Agent backed by the
Workbench API. Agents validate every LLM response against Pydantic
output models to guarantee structured, reliable outputs.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
document_format
|
Optional[Union[DocumentFormat, str]]
|
Output format for generated documents. |
PDF
|
template_img_path
|
Optional[str]
|
Path to a reference image used to derive an HTML template (optional). |
None
|
subscription_key
|
Optional[str]
|
Workbench API subscription key. Required unless the installed package provides an embedded default (internal builds). |
None
|
charging_code
|
Optional[str]
|
Workbench |
None
|
generate(document_type=None, audience=None, tone=None, purpose=None, region=None, language=None, length=None, topics=None, style_guide=None, data=None, output_dir=None, **kwargs)
Generate documents based on input specifications.
Each call produces exactly one output document.
When data is provided, only document_type is required. User data
is serialized as the document body; the content-generation LLM is skipped.
Other parameters are optional hints for HTML generation.
When data is None, specifications are sampled/validated as before and
the content LLM runs before HTML generation.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
document_type
|
str | None
|
Type of document to generate (required if |
None
|
audience
|
str | None
|
Target audience for the document |
None
|
tone
|
str | ToneCategory | None
|
Desired tone (formal, casual, etc.) |
None
|
purpose
|
str | None
|
Purpose of the document |
None
|
region
|
str | None
|
Target region/locale |
None
|
language
|
str | None
|
Language of the document |
None
|
length
|
str | None
|
Desired length of the document |
None
|
topics
|
str | None
|
Key points to cover |
None
|
style_guide
|
str | None
|
Style guide to follow |
None
|
data
|
Optional[Dict[str, Any]]
|
Structured payload for a single document; when set, body text is taken from this dict (JSON-serialized), not from the content agent. |
None
|
output_dir
|
Optional[str]
|
Directory to store generated documents |
None
|
**kwargs
|
Reserved for forward compatibility |
{}
|
Raises:
| Type | Description |
|---|---|
ValueError
|
If input validation fails or document format is unsupported |
generate_dataset(config, n_docs=None, output_dir=None, seed=None, return_metadata=True, max_workers=None)
Generate multiple documents from a :class:DatasetConfig.
This is an orchestrator that expands the config into n_docs
per-document parameter sets (respecting weighted/uniform variation
distributions) and calls :meth:generate once per document.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
config
|
DatasetConfig
|
Dataset-level configuration describing base params, variations, and optional constraints. |
required |
n_docs
|
Optional[int]
|
Number of documents to generate when |
None
|
output_dir
|
Optional[str]
|
Directory to store generated documents. A temporary directory is created when not provided. |
None
|
seed
|
Optional[int]
|
Optional seed for deterministic variation expansion. |
None
|
return_metadata
|
bool
|
If |
True
|
max_workers
|
Optional[int]
|
When |
None
|
Returns:
| Type | Description |
|---|---|
Optional[List[Dict[str, Any]]]
|
A list of metadata dicts (one per document) when |
Optional[List[Dict[str, Any]]]
|
return_metadata is |
Raises:
| Type | Description |
|---|---|
ValueError
|
If document count is undefined or invalid (empty
|
generate_from_template(document_type, information, output_dir)
Generate documents by applying content to an image-derived HTML template.
