Document generation from a template image
DocumentGenerator can reproduce the
visual design of an existing document. Instead of letting the HTMLAgent invent a
layout, generate_from_template(...) derives an HTML skeleton from a reference
image (a screenshot or scan of a real document) and reuses it for every piece of
content you provide. This is the right mode when every output must match a fixed
brand, letterhead, or form layout.
How it works
- The reference image at
template_img_pathis analysed via vision to produce a reusable HTML skeleton with{{ placeholders }}. - That single skeleton is broadcast across all of your
informationitems, so the batch shares one consistent design. - Content is injected into the placeholders by the same extraction-and-assembly
step used by
generate(), so long content does not get truncated.
generate_from_template(...)
document_typeis required (for example"Invoice","Report").informationis the content to place into the template. Pass a single string or dict, or a list for a batch. List items may be plain strings or dicts (dicts are JSON-serialized automatically).template_img_pathis the path to the reference image (PNG/JPG) whose layout is reproduced.output_diris the directory where output files are written.logo_path(optional) embeds a logo into the template's brand/logo slot. Supported formats: PNG, JPG/JPEG, GIF, SVG, WEBP.
The method returns the list of output file paths.
Full example
"""
Document Generator — generate_from_template
Demonstrates ``DocumentGenerator.generate_from_template``: instead of letting the
HTMLAgent design a layout, the visual design is derived from a *reference image*
(a screenshot or scan of a real document). The extracted HTML skeleton is reused
across every ``information`` item, so all output documents share the same look
while their content differs.
Typical use cases: replicate a corporate invoice/letterhead style, or produce a
batch of documents that must all match an existing brand template.
"""
from pathlib import Path
from ydata.synthesizers.text.model.document import DocumentGenerator, DocumentFormat
if __name__ == "__main__":
# Step 1: Initialize the DocumentGenerator with the desired output format
print("Initializing Document Generator...")
generator = DocumentGenerator(
document_format=DocumentFormat.PDF # Output format: PDF, DOCX, or HTML
)
# Step 2: Point to a reference image whose layout should be reproduced.
# The image (PNG/JPG screenshot or scan) is analysed via vision to build a
# reusable HTML skeleton with {{ placeholders }}.
assets_dir = Path(__file__).resolve().parent / "assets"
template_img_path = str(assets_dir / "invoice_template.png") # Replace with your image
output_dir = "output/from_template"
# Step 4a: Single document — `information` as a structured dict.
# A well-structured JSON payload is preferred over free text: each key maps
# cleanly onto a template placeholder, so names, amounts, and line items are
# substituted reliably. (A plain string is also accepted for quick tests.)
print("\n=== Generating a single document from a template image ===")
generator.generate_from_template(
document_type="Invoice",
information={
"invoice_number": "INV-2026-00142",
"issue_date": "2026-06-25",
"due_date": "2026-07-25",
"payment_terms": "Net-30",
"seller": {
"name": "Acme Consulting LLC",
"address": "500 Market Street, Suite 1200, San Francisco, CA 94105",
"tax_id": "US-84-1234567",
"email": "billing@acme-consulting.com",
},
"bill_to": {
"name": "Globex Corporation",
"contact": "Jane Whitfield, Accounts Payable",
"address": "742 Evergreen Terrace, Springfield, IL 62704",
"email": "ap@globex.com",
},
"line_items": [
{
"description": "Senior consulting — cloud architecture review",
"quantity": 12,
"unit": "hours",
"unit_price": 150.00,
"amount": 1800.00,
},
{
"description": "Monthly support retainer",
"quantity": 1,
"unit": "month",
"unit_price": 200.00,
"amount": 200.00,
},
],
"subtotal": 2000.00,
"tax_rate": 0.08,
"tax_amount": 160.00,
"total_due": 2160.00,
"currency": "USD",
"notes": "Thank you for your business. Late payments accrue 1.5% monthly.",
},
output_dir=output_dir,
template_img_path=template_img_path,
)
# Step 4b: Batch — `information` as a list of structured dicts. The same
# image-derived template is reused for every item, so all documents share one
# consistent design while their field values differ.
print("\n=== Generating a batch of documents from the same template ===")
generator.generate_from_template(
document_type="Invoice",
information=[
{
"invoice_number": "INV-2026-00143",
"issue_date": "2026-06-25",
"due_date": "2026-07-25",
"payment_terms": "Net-30",
"bill_to": {
"name": "Vendor A Logistics",
"address": "18 Harbor Road, Seattle, WA 98101",
},
"line_items": [
{"description": "Cloud migration — phase 1", "quantity": 1,
"unit_price": 4200.00, "amount": 4200.00},
{"description": "Post-migration support (3 mo.)", "quantity": 3,
"unit_price": 350.00, "amount": 1050.00},
],
"subtotal": 5250.00,
"tax_rate": 0.095,
"tax_amount": 498.75,
"total_due": 5748.75,
"currency": "USD",
},
{
"invoice_number": "INV-2026-00144",
"issue_date": "2026-06-25",
"due_date": "2026-08-09",
"payment_terms": "Net-45",
"bill_to": {
"name": "Vendor B Financial Group",
"address": "1200 Brickell Avenue, Miami, FL 33131",
},
"line_items": [
{"description": "Security audit — external penetration test",
"quantity": 1, "unit_price": 3100.00, "amount": 3100.00},
{"description": "Remediation report & advisory call",
"quantity": 2, "unit_price": 450.00, "amount": 900.00},
],
"subtotal": 4000.00,
"tax_rate": 0.07,
"tax_amount": 280.00,
"total_due": 4280.00,
"currency": "USD",
},
],
output_dir=output_dir,
template_img_path=template_img_path,
)
# Step 4c: Optionally embed a logo into the template's brand/logo slot.
# Supported formats: PNG, JPG/JPEG, GIF, SVG, WEBP.
print("\n=== Generating from a template with a logo ===")
generator.generate_from_template(
document_type="Invoice",
information="Invoice for Globex Inc: annual software license renewal.",
output_dir=output_dir,
template_img_path=template_img_path,
logo_path="assets/globex_logo.png", # Replace with your logo
)
print(f"\nDocuments written to: {output_dir}")
Related
- Document generation from existing data
- DatasetConfig reference
- Synthetic documents overview
- API: DocumentGenerator
