Q&A Generator
The Q&A Generator allows you to automatically generate Question & Answer pairs from various document sources. This is useful for creating training data, FAQ generation, or educational content from existing documents.
This example demonstrates how to use x module in ydata-sdk
to generate pairs of Questions and Answers.
- Generate Q&A pairs from single documents
- Process multiple documents from a folder
- Work with PyArrow tables containing document data
- Support for multiple document formats (DOCX, TXT)
- Customizable number of Q&A pairs per document
Don't forget to set up your license key
Example Code
"""
Document Q&A Generation Example
"""
import os
import pyarrow as pa
from ydata.synthesizers.text.model.qa import DocumentQAGeneration
if __name__ == "__main__":
#Authenticate to ydata-sdk
os.environ['YDATA_LICENSE_KEY'] = '{add-your-key}' # Replace with your license key
# Step 1: Initialize the Q&A generator
# You can use either OpenAI or Anthropic as the provider
print("Initializing Q&A Generator...")
qa_generator = DocumentQAGeneration()
# Step 2: Generate Q&A pairs from a single document
print("\n=== Processing Single Document ===")
single_doc_result = qa_generator.generate(
input_source="path/to/your/documents/folder/doc.docx", # Replace with your document path
docs_extension="docx", # Supported formats: "docx" or "txt"
num_qa_pairs=10, # Number of Q&A pairs to generate
)
print("Single document Q&A pairs:")
print(single_doc_result)
# Step 3: Generate Q&A pairs from multiple documents in a folder
print("\n=== Processing Multiple Documents ===")
folder_result = qa_generator.generate(
input_source="path/to/your/documents/folder/", # Replace with your folder path
docs_extension="docx", # Process all documents with this extension
num_qa_pairs=20, # Number of Q&A pairs per document
)
print("Multiple documents Q&A pairs:")
print(folder_result)
# Step 4: Generate Q&A pairs from a PyArrow table
print("\n=== Processing PyArrow Table ===")
# Create a sample table with document content
documents_table = pa.table({
"text": [
"This is a sample document about machine learning. It discusses various algorithms and their applications.",
"Another document about data science and its importance in modern business."
],
"metadata": [
{"source": "doc1", "author": "John Doe"},
{"source": "doc2", "author": "Jane Smith"}
]
})
table_result = qa_generator.generate(
input_source=documents_table,
num_qa_pairs=3, # Number of Q&A pairs per document
)
print("PyArrow table Q&A pairs:")
print(table_result)