Skip to content

QA generator

ydata.synthesizers.text.model.qa.DocumentQAGeneration

Bases: BaseGenerator

A class for generating Question-Answer pairs from documents using Large Language Models. Inherits from BaseGenerator for common LLM functionality.

Features
  • Support for multiple document formats (DOCX, TXT)
  • Batch processing of multiple documents
  • Configurable LLM selection
  • Persistent configuration saving/loading
  • PyArrow integration for efficient data handling
  • LangChain integration for document processing and chunking

Parameters:

Name Type Description Default
model_type ModelType

The type of LLM to use

required
model_name str

Specific model name to use

None
chunk_size int

Size of text chunks for processing

1000
chunk_overlap int

Overlap between chunks

200
document_type DocumentType

Type of document being processed

required

generate(input_source, docs_extension='docx', num_qa_pairs=10)

Generate Q&A scenarios from documents.

Parameters:

Name Type Description Default
input_source Union[str, Table]

Either a path to a document/folder or a pyarrow Table

required
docs_extension str

Extension of documents to process

'docx'
num_qa_pairs int

Number of Q&A pairs to generate

10
output_dir

Directory to store intermediate results

required

Returns:

Type Description
Table

pa.Table: PyArrow table containing the generated Q&A pairs

Raises:

Type Description
ValueError

If input_source is invalid