Skip to content

Q&A Generator

The Q&A Generator allows you to automatically generate Question & Answer pairs from various document sources. This is useful for creating training data, FAQ generation, or educational content from existing documents. This example demonstrates how to use x module in ydata-sdk to generate pairs of Questions and Answers.

  • Generate Q&A pairs from single documents
  • Process multiple documents from a folder
  • Work with PyArrow tables containing document data
  • Support for multiple document formats (DOCX, TXT)
  • Customizable number of Q&A pairs per document

Don't forget to set up your license key

    import os

    os.environ['YDATA_LICENSE_KEY'] = '{add-your-key}'

Example Code

"""
Document Q&A Generation Example
"""
import os

import pyarrow as pa
from ydata.synthesizers.text.model.qa import DocumentQAGeneration

if __name__ == "__main__":
    #Authenticate to ydata-sdk
    os.environ['YDATA_LICENSE_KEY'] = '{add-your-key}'  # Replace with your license key
    # Step 1: Initialize the Q&A generator
    # You can use either OpenAI or Anthropic as the provider
    print("Initializing Q&A Generator...")
    qa_generator = DocumentQAGeneration()

    # Step 2: Generate Q&A pairs from a single document
    print("\n=== Processing Single Document ===")
    single_doc_result = qa_generator.generate(
        input_source="path/to/your/documents/folder/doc.docx",  # Replace with your document path
        docs_extension="docx",  # Supported formats: "docx" or "txt"
        num_qa_pairs=10,  # Number of Q&A pairs to generate
    )
    print("Single document Q&A pairs:")
    print(single_doc_result)

    # Step 3: Generate Q&A pairs from multiple documents in a folder
    print("\n=== Processing Multiple Documents ===")
    folder_result = qa_generator.generate(
        input_source="path/to/your/documents/folder/",  # Replace with your folder path
        docs_extension="docx",  # Process all documents with this extension
        num_qa_pairs=20,  # Number of Q&A pairs per document
    )
    print("Multiple documents Q&A pairs:")
    print(folder_result)

    # Step 4: Generate Q&A pairs from a PyArrow table
    print("\n=== Processing PyArrow Table ===")
    # Create a sample table with document content
    documents_table = pa.table({
        "text": [
            "This is a sample document about machine learning. It discusses various algorithms and their applications.",
            "Another document about data science and its importance in modern business."
        ],
        "metadata": [
            {"source": "doc1", "author": "John Doe"},
            {"source": "doc2", "author": "Jane Smith"}
        ]
    })

    table_result = qa_generator.generate(
        input_source=documents_table,
        num_qa_pairs=3,  # Number of Q&A pairs per document
    )
    print("PyArrow table Q&A pairs:")
    print(table_result)