QA generator
ydata.synthesizers.text.model.qa.DocumentQAGeneration
Bases: BaseGenerator
A class for generating Question-Answer pairs from documents using Large Language Models. Inherits from BaseGenerator for common LLM functionality.
Features
- Support for multiple document formats (DOCX, TXT)
- Batch processing of multiple documents
- Configurable LLM selection
- Persistent configuration saving/loading
- PyArrow integration for efficient data handling
- LangChain integration for document processing and chunking
Parameters:
Name | Type | Description | Default |
---|---|---|---|
model_type
|
ModelType
|
The type of LLM to use |
required |
model_name
|
str
|
Specific model name to use |
None
|
chunk_size
|
int
|
Size of text chunks for processing |
1000
|
chunk_overlap
|
int
|
Overlap between chunks |
200
|
document_type
|
DocumentType
|
Type of document being processed |
required |
generate(input_source, docs_extension='docx', num_qa_pairs=10)
Generate Q&A scenarios from documents.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
input_source
|
Union[str, Table]
|
Either a path to a document/folder or a pyarrow Table |
required |
docs_extension
|
str
|
Extension of documents to process |
'docx'
|
num_qa_pairs
|
int
|
Number of Q&A pairs to generate |
10
|
output_dir
|
Directory to store intermediate results |
required |
Returns:
Type | Description |
---|---|
Table
|
pa.Table: PyArrow table containing the generated Q&A pairs |
Raises:
Type | Description |
---|---|
ValueError
|
If input_source is invalid |